Transcript: Scary Drupal Migrations with Benji Fisher

This is an edited transcript. For the blog post and video, see Scary Drupal Migrations with Benji Fisher.

[00:00:00] Michael Meyers: Welcome to Tag1 TeamTalks, brought to you by Tag1 Consulting. I'm Michael Meyers, the Managing Director of Tag1, and we have two special Halloween episodes for you. In both episodes, we're going to be talking about platform migration stories. Scary cautionary tales to learn from, based on experiences that we've all had prior to joining Tag1.

In today's episode, I'm going to be chatting with Benji Fisher. top contributor to the Drupal open source content management system. Among his many contributions, Benji is one of the core Migrate API maintainers, which is to say that Benji is responsible for your ability to migrate from other platforms onto Drupal and the ability to migrate from older versions of Drupal, including Drupal 7, onto the latest version of Drupal, which is something that tens of thousands of organizations are going through right now. When you've been part of a lot of large scale migrations like Benji, you've undoubtedly run into challenging situations. And our hope is that in sharing some of these scary stories and talking about things that [00:01:00] went awry and challenges that we faced, you can better navigate your situation.

Today, Benji is going to share two implementation focused stories. Please keep an eye out for episode two in our Halloween series with Janez Urevc, another top contributor to Drupal, who's going to talk more about the human side of things and the challenges that organizations face when it comes to navigating the politics and staffing issues when changing tech stacks.

Hello and welcome to Tag1 Team Talks, podcast of Tag1 Consulting. We've got a special Halloween episode for you folks today.

[00:01:34] Michael Meyers: We're going to be telling scary stories about platform and data migrations. If you've ever been involved in a large scale platform or data migration project, things can get a little scary at times. We've got some great stories from our migration experts that we're going to talk through.

So welcome Benji. Thanks so much for joining me.

[00:01:53] Benji Fisher: Hi, great to be here.

[00:01:56] Michael Meyers: So we're going to tell two scary stories. Uh, one [00:02:00] that goes back in time a little bit, uh,

first story, um, tell me, you know, sort of set the stage. Give me like, when did this take place? ?

[00:02:10] Benji Fisher: So this was in 2019, and I was working as sort of a subcontractor, on a government website.

There are a couple of scary things about that. One is that I had to get a security clearance, so I had to go to my local police department and get fingerprinted and go through that rigmarole. On the other hand a nice thing about that is that it's a matter of public record. The government websites are all public on GitHub.

I guess not all, but this one certainly is. And my records, my contributions are a matter of public records.

[00:02:47] Michael Meyers: Were you worried about the background check?

This is a pre Tag1 story. Um, so, uh, a government website, a data migration. Tell me a little bit about the project.

[00:02:58] Benji Fisher: Yeah, it was not a [00:03:00] data migration in the sense that we had an old version of the website and we're building a new one. But there was a lot of data on the site and it could be entered manually through the Drupal UI.

But there were also requirements that we'd be able to import from and export to XML files in a pretty specific XML format. And that format was pretty complicated. So that was one of the scary things about the project especially considering the fact that at the time I joined the project, they were already aware that it was on a tight schedule contract negotiations and stuff had dragged on longer than they should have, and we didn't have as much time as we really needed.

And we had a pretty sharp deadline government contracts frequently have that pretty firm deadline at the end of fiscal year that they have to get completed by. And then the original plan for importing from [00:04:00] the XML files the original plan was to use the feeds module which kind of scared me for one thing, in 2019, the feeds module for Drupal 8 was still in alpha.

I do remember going back a little further than that. I live in the Boston area, so I've had the opportunity to participate in the Boston Drupal meetups. And early in my Drupal career when I was asking for advice, should I use feeds on this project or not? I pointed out at that point years ago feeds for Drupal 6 was still in beta, feeds for Drupal 7 was in alpha, should I use it anyway?

And I remember Moshe Weitzman was at the meetup and he said, it's the most stable alpha I've ever seen. So I have used feeds. I, I do use feeds when it's appropriate and when it's appropriate. It is certainly easier than using the Migrate API. It is appropriate if you have something like an Atom feed [00:05:00] where you have a bunch of articles and each article has a title and a body field, one or two images all pretty straightforward things to copy over.

Maybe it has a list of tags that, that you need to import. But when you start getting into complex dependencies, where you have to import a bunch of tiny pieces and then group them into larger pieces and then attach those. For that sort of complex dependency structure the migrate API is much more appropriate than the feeds module.

And that's the situation we had with this government site. The XML structure was really complicated. So the idea of using feeds for that was by itself pretty scary.

[00:05:49] Michael Meyers: I find XML to be scary on a good day. So you have a ridiculously tight deadline. It's a toss up between the alpha feeds. I [00:06:00] love that statement from Moshe.

The most stable alpha I've seen. I'm not sure what that means.

[00:06:05] Benji Fisher: It means it was usable. It's not crazy to use it on a production site.

[00:06:13] Michael Meyers: Not too crazy. We're big fans of pushing technology, developing technology. The Migrate API had to be pretty new at that point, right? Six to seven. Was that pretty stable?

[00:06:24] Benji Fisher: Oh, by, by 2019? Drupal... Eight had been out for a few years at that point. So the migrate module for Drupal seven was rock solid.

I think the contrib migrate module started out in Drupal five and Moshe and Mike Ryan were the ones who had it started. The migrate API in Drupal eight was fairly stable. It never managed to meet all of the hopes that people had for it. It was always intended as the upgrade path from [00:07:00] Drupal seven.

And initially we hoped that it would be more stable than it ever got to be. It still has problems with multilingual migrations with lots of contrib modules and revisions and such. There are still some cases that are really hard. To migrate with the API, but the framework in place which is what you need.

Again this is not a site migration. This was data import. And that was really quite stable by 2019. That's. One of the two reasons that I, I thought it was the appropriate tool for the job.

[00:07:38] Michael Meyers: So how did it go?

[00:07:40] Benji Fisher: The way to tame XML. And, like a lot of people take your point of view that they just shy away from it.

But the way to tame it is a related language called XPath, which is a way of specifying parts of an XML file. And it's analogous to the CSS selectors [00:08:00] that jQuery uses, but it's much more powerful. It's incredibly expressive and and that's what we use when we are importing something from an XML source.

and using the Migrate API. And on this project, I got a real appreciation for just how flexible XPath is. And I just checked. Yeah, again, this this is a government project. It's on a public GitHub repo. I can go check this work that I did four years ago. One of the expat expressions is 236 characters long.

It's complicated and it's in a file that's something like. It's 3, 941 lines long. So I call that pretty scary.

[00:08:52] Michael Meyers: Was that auto generated?

[00:08:54] Benji Fisher: No, this was written by hand. And [00:09:00] the YAML files is pretty simple. Most of that 3, 900 lines is name and then the field name that we're importing to and label, some text thing, selector, index path expression.

And over and over again. So there's got to be something like 800 to 1000 of those little groups. So the structure of the yaml file is very simple, but there's a lot of it because there were a lot of different fields on this content type that we were migrating into.

[00:09:38] Michael Meyers: That sounds like a nightmare to manage, nightmare to create, a nightmare to manage.

That's amazing.

[00:09:43] Benji Fisher: I feel a little guilty about some of these. XPath expressions in there. I can't imagine that anyone has fun if that file ever needs to be updated. Adding a comment line here and there, that's no problem. If there's a new [00:10:00] field that has to be filled out,

it's

[00:10:05] Michael Meyers: Of those 300, 3, 941 lines or any documentation or comments.

[00:10:10] Benji Fisher: A few of them just one line comments for grouping, here begins Section 5A of the import form and here are the fields for Section 5A.

[00:10:21] Michael Meyers: And did you guys meet the deadline?

[00:10:23] Benji Fisher: I believe we did. As far as I recall, we launched the site, it needed extra resources to handle the admin UI. As I said, you could import data from an XML file or you could enter it manually. When you're entering it manually, you had hundreds of fields to fill out, and that meant it was a resource hog to edit something in the admin UI. And I mostly focused on the import and export of XML, but other people on the team were [00:11:00] trying to deal with the user experience and the performance problems created by that structure.

[00:11:05] Michael Meyers: Wow.

Looking back, would you have done anything differently? Besides the contract,

[00:11:13] Benji Fisher: I didn't have any control over that. If we had more time to explore our options, I would have thought about using tuples normalization and serialization APIs instead of the migrate API. I've never worked with those APIs, so I don't really know.

I've spent some time exploring, but I wonder whether that might have been a more manageable, better structured way of representing it than the 4, 000 line YAML file.

[00:11:48] Michael Meyers: That is one of the craziest things I've heard. Definitely qualifies as a scary, a scary story.

so let's let's hop back in [00:12:00] time.

You have another scary story to share with us involving an LMS learning management system.

[00:12:06] Benji Fisher: That's right. So this was in 2013 or 2014. It was my second professional job doing Drupal. And yeah we built and maintained this LMS. It had a bunch of activities that students would go through and those were all written in some JavaScript package.

I seem to recall the package was called Storybook, but I might be misremembering that. At any rate, it's not the Storybook that we use these days for component driven design, but it was some JavaScript package and and we had these activities. And If you think back to 2014 the web has moved a lot since then.

Most sites then were not encrypted. They were running under HTTP and certainly ours was. And if the idea of entering your [00:13:00] password Into an unencrypted site isn't scary. I don't know what is. I think also this was about the time that Google was starting to push in the right direction, saying we're going to change your search relevance.

Based on whether or not you're encrypted. Remember that?

[00:13:18] Michael Meyers: I think it's

so funny that you were saying that when you were saying that like, at first, it was so hard to recall that websites didn't have it. And then it hit me things. I remember when Google started prioritizing the search results as an encouragement to get people to switch over.

And it really wasn't that long ago. That is scary. That's crazy.

[00:13:39] Benji Fisher: So anyway we thought it was a good idea to encrypt everything. And the first job was to convince the client because the client somehow believed that it was going to be a huge performance hit. That the encryption, decryption, and [00:14:00] the HTTPS protocol was going to add significant time.

Especially they said, we've got our servers in the U S and if you're connecting to the site from the U S. which we were those of us who were developing it we were all in the U. S. If you connect from a U. S. web browser to a U. S. server, there's less latency. And they said, but we've done some testing from our server in France, in Europe, and with the transatlantic connection and yada yada, and the protocol and the handshakes it's adding a significant amount of time.

to delivering these Drupal pages. And I knew this was bogus. Back in the 1990s, where a web server was reading files off a disk and shipping them off over HTTP, then the overhead of encrypting it was significant compared to [00:15:00] the load on the web server. That is not where we were. We were running a Drupal site.

We were not caching because almost all of our traffic was authenticated. People were logged onto the site. They were getting their own personal results. So Drupal was building every page on every request. It was doing database queries. It was running PHP to to generate pages. And then it was sending them over the wire.

And In that context, the overhead of encryption is pretty negligible. Now, I am not a performance engineer. I certainly defer to other people at Tag1 on questions of this. I cannot tell you how many microseconds it adds to a request, but I... know, now, and I knew back in 2014 that it was pretty minimal compared to building up a page in Drupal.

And so my first suggestion was that they fly me out to Paris so I could [00:16:00] poke at the server myself and figure out why it was taking so long. And somehow they didn't like that idea. But they did give me SSH access. So from the U.S. I could connect remotely to their server and I could.

then from their server request pages over HTTP or over HTTPS. And there were three of us sitting in a room doing this and we were using Wget from the command line. I still use Wget. I know that cURL is more popular with with the in crowd these days, but I am used to Wget.

And the nice thing about Wget is that it's a new command line tool. It has a ton of options. And in particular, you can inspect the server response. And so we were requesting it, looking at the server response, and finally someone in the room said, Hey, wait a minute. This server response is telling me... That when I requested over HTTP, I'm getting a [00:17:00] locally cached version, but not when it's over HTTPS.

And of course, that makes sense. If you're requesting something over HTTP, it's unauthenticated. The response comes back with some TTL information, how long this response is valid. And. The lazy French server doesn't want to request it again. It'll say, okay, I'm going to cache this for as long as the TTL says.

And if someone makes the same request, I'm going to give them the same answer, but you can't do that with an encrypted response because you don't know that it's going to be the same response. The next time it goes, you can't, and you don't even know what it is. You can't, the server can't decrypt it only the.

Browser, the W get at the end knows how to encrypt it. So that explained why it was so much faster to get the HTTP results than to get the encrypted results. So that was the first step we had to [00:18:00] convince the client that HTTPS was not going to kill their performance. One of the suggestions that was made was, Oh, let's just.

Encrypt the pages where they enter their password. And there is a Drupal module that does this. It's the secure pages module.

[00:18:18] Michael Meyers: That only encrypts the page you log in on, but nothing else.

[00:18:22] Benji Fisher: You get to specify which pages get encrypted. So you could encrypt every admin page plus the login page and whatever else you but other pages won't be encrypted.

And so it's going to pass you off from HTTPS to HTTP, depending on which page you're looking at. And it requires a couple of patches to Drupal core before it'll work. And, You should not use this module. I'm sorry.

[00:18:51] Michael Meyers: I'm sure it doesn't exist anymore, I hope.

[00:18:53] Benji Fisher: Oh, I'm afraid it does. And it's still used by a few thousand sites.

I just checked [00:19:00] earlier today. But, don't use it. Speaking as a member of the Drupal security team, don't use this module. It is a scary module.

[00:19:11] Michael Meyers: Wow. That'll be a follow up episode. Scary modules. You just

shouldn't.

[00:19:19] Benji Fisher: So then it came time to implement it, and we... bought the SSL certificate. Remember, this was long before Let's Encrypt gave us free certificates. We installed it on the servers in Apache config and we tested it on the stage site, but not well enough. And then we went live. We flipped the switch.

All traffic was going over HTTPS. And the site broke. Specifically, remember I mentioned that this was an LMS and we had all of these activities written in some JavaScript package? Every time you finished one step in an activity and went on to the next [00:20:00] step, it was following a URL. And that URL was saved in this compressed JavaScript file.

And of course, All of those hard coded URLs, we're using the HTTP paths.

And and you were, were

completing the step,

the server was forcing HTTPS.

Yeah. And it's it's not a situation where you can just follow or redirect because it's in a JavaScript call and the server isn't going to trust it.

I forget all the technical details, but we had to update those URLs and I checked with the people who worked with the package and said how much trouble would it be to go into a hundred of these activities and, edit these completion URLs, changing them from HTTP to HTTPS, save them, export them, [00:21:00] and then we can upload to the server.

And it was going to take days. Okay, that's not an option. I looked around and I found a command line tool that would convert this custom format into an XML file. And I could install that on Ubuntu. The servers were all Ubuntu. So just apt get install this package. And so I converted them to XML. I had a little sed script to replace the URLs in the appropriate element.

And then the command line tool would convert it back into the custom format. I wrote a shell script that would first save a backup of the file and then do those steps that I mentioned. Convert, sed, convert back, save. And within a few hours, I had the problem fixed.

And, oh yes, the other scary thing I forgot to mention, is that it just so [00:22:00] happened that several muckety mucks at the client were visiting our offices that day from Germany, right?

They come over, they were talking with the people, the project managers, and so on and so forth. And that was the day that we switched over to HTTPS and the whole site broke. So no pressure, Benji, but it would be good if you could solve this by lunchtime.

[00:22:24] Michael Meyers: I'll point out that both your scary stories... Involve XML.

Coincidence.

[00:22:31] Benji Fisher: I noticed that too, absolutely.

[00:22:34] Michael Meyers: I love that you wrote a shell script that is converting things into other things and back again. But ingenius it worked. And and very quickly.

[00:22:46] Benji Fisher: Oh, yeah

[00:22:47] Michael Meyers: Man, it blows my mind that things it's hard to remember that things weren't encrypted.

And that is super scary. Today it's extremely rare to find a site that isn't encrypted. [00:23:00]

[00:23:00] Benji Fisher: Yeah, there are a few lying around that haven't been touched in ten years. But anything current, yeah, it just, it's part of the deal.

[00:23:11] Michael Meyers: And you got this site back online before lunch?

[00:23:14] Benji Fisher: Before I ate lunch, that's for sure.

[00:23:21] Michael Meyers: Wow, that was awesome. Thank you so much for sharing your scary stories for our Halloween episode. I'm sure it wasn't fun at the time, but it's fun to look back and laugh at, the craziness that we've dealt with across some different projects. So thanks for sharing and

I look forward to having you back soon. Thank you. Take care, Benji. Take care, Mike.

A huge thank you to Benji Fisher for sharing his scary stories from his time before joining Tag1. Make sure that you can check out our other Halloween episode with Janez Urevc , another top contributor to Drupal, who's going to be talking more about the human side of things and the [00:24:00] challenges organizations face when it comes to navigating the politics and staffing issues that organizations often face when changing tech stacks.

If you liked this talk, please remember to upvote, subscribe and share it out. Check out past team talks at tag1. com slash TTT. That's three T's for Tag1 Team Talks. As always, we'd love your feedback, any topic suggestions and input. You can write to us at TTT at tag1. com. That's T A G, the number one dot com.

A big thank you to our guest, Benji. For everyone who tuned in, thanks for joining us.