Transcript: Unraveling the ETL Data Migration Process : Understanding Load

This is an edited transcript. For the blog post and video, see Unraveling the ETL Data Migration Process- Part 3: Understanding Load.

[00:00:05] Janez Urevc: Welcome to Tag1 Team Talks, brought to you by Tag1 Consulting. With Drupal 7 and Drupal 9 rapidly approaching end of life, we are hearing people talk about migrating and upgrading more than ever before. And anyone who's ever been involved with the large scale migration, migrating a large site or application from one technology stack to another, will tell you that it's complex, time consuming, and it demands expertise.

[00:00:34] Janez Urevc: That's why we are bringing you this series of talks, diving deep into the world of Drupal migrations, and who better to guide us than Tag1's very own Drupal migration experts. From the masterminds and maintainers of Drupal's migration tooling, to the individuals behind the most groundbreaking Drupal migrations, we've got an all star lineup who'll cover everything you need to know about every aspect [00:01:00] of migrating large scale applications.

[00:01:04] Janez Urevc: This team talk is part of the three part series about ETL, Extract, Transform, and Load process, which is used by many enterprise migration systems, Drupal's Migrate included. In today's episode, we're going to talk about how to use Drupal's Migrate system to Load data into the destination system, which is usually Drupal, but not necessarily always.

[00:01:29] Janez Urevc: Be sure to stick around to the end because we're, uh, also going to announce the next few talks in our series. Let's dive in. I'm Janez Urevc, senior engineer here at Tag1 and a long time contributor to Drupal, and I'm joined today by well known top contributors to Drupal, Benji Fisher, one of the five current Drupal Migrate core subsystem maintainers, and Mike Ryan, co creator of Migrate.

[00:01:55] Janez Urevc: Welcome, and thank you for joining me.

[00:01:58] Janez Urevc: In case you didn't already [00:02:00] watch or listen to the previous two episodes in this series about E, Extract, and T, Transform, I'd suggest that you do so. In the first episode, we, among other things, provided a high level overview of what ETL stands for. So, we're not going to cover this today, and we will dive directly into today's topic, which is...

[00:02:25] Janez Urevc: L, Load. So Benji, could you tell me and to our audience, of course, what is being done as part of the Load process and how specifically is that done in Drupal Migrate?

[00:02:43] Benji Fisher: Sure, so almost all of the time, um, the load phase is going to be creating entities. Um, Drupal is structured with, um, you know, a fairly consistent content model.

[00:02:57] Benji Fisher: So we can have [00:03:00] configuration entities and we can have content entities. So content entities are things like taxonomy terms, nodes, users. Um, and, and we'll be creating each of those in a separate migration. Um, each migration should have just a, a single destination type, so you'll want separate migrations in the same project, one, one for each.

[00:03:27] Benji Fisher: type of entity you're creating. Um, the configuration entities are often settings. Um, but, uh, but blocks, for example, each block is a configuration entity, and you could have one, one migration in your project to create block entities. Um, But there, uh, there are other things, those are the most common ones, but, um, the entire migration system is [00:04:00] pluggable, the, um, in the load phase we have destination plugins, and there are alternatives to, um, to entities, you might be creating a custom database table, um, and there's, uh, I think one example we'll be talking about later where we're actually migrating into the

[00:04:23] Benji Fisher: Drupal state system, which, uh, if you drill down a little bit, turns out to be the key value table in Drupal.

[00:04:37] Janez Urevc: One would think that, you know, Drupal migrates will only need one destination for its migrations, which is Drupal. Um, but Mike, you, when you were designing Migrate, you decided. To make it pluggable in general, but also make destinations pluggable. Um, can you talk about [00:05:00] the reasoning, like why, why you left the door open to, to run migrations that, that store data into anything, basically?

[00:05:12] Mike Ryan: Well, um, well, the main thing was that at the time we originally developed Migrate, which was Drupal 6, uh, basically every module that managed content in Drupal had its own database table, its own schema. There was no general purpose entity system. So basically each type of data in Drupal needed its own, uh, destination plugin.

[00:05:43] Mike Ryan: Um, and. So, um, it was just natural to use the same sort of plugin system as we're using for the extractor. Um, and of course, once you've [00:06:00] got that flexibility, you can start to think of other ways to use it. For example, you can have a CSV, uh, destination plugin to export data using the migration system. If you want to pull, um, pull a Drupal data out into, uh, format important to something else, you can write a migration that extracts your Drupal data, transforms it into the, um, proper, uh, format and then loader that dumps it.

[00:06:41] Janez Urevc: And even, even inside Drupal now we have, as far as I know, different classes for different entity types, right? Like different destination classes. For noise for taxonomy terms. All right.

[00:06:56] Mike Ryan: There there is a, you know, they're built on a [00:07:00] general, um, into the destination, but many, um, there is, of course, a big difference between content entities and configuration entities, but also even among those.

[00:07:16] Mike Ryan: A number of the, um, of entity types need a little bit of special handling. Uh, for example, users, you have to deal with, uh, passwords.

[00:07:29] Benji Fisher: So the person writing the migration doesn't necessarily know whether there's any special processing. They know they're creating... Entity types of type user. So they say content entity, colon user. And if there is special processing, there'll be a special, um, load class to, to manage that, and if not, it'll just fall back to the generic content entity of type user.

[00:07:56] Mike Ryan: Right. And, um, we, we haven't touched [00:08:00] on that so far, but perhaps we should that most of your migration logic as such as it is will be implemented in simple YAML files, basically migrations are configuration. Um, if the existing plugins serve the needs for your particular, uh, migration. Application, then basically you just write a bunch of YAML and made in serious migrations, you often need to write an occasional, um, transformer in PHP, but most of your work is just YAML, so it's very readable.

[00:08:52] Mike Ryan: Very simple to put together.

[00:08:56] Mike Ryan: That isn't really part of the load talk. That should, but we should [00:09:00] cover it somewhere.

[00:09:02] Janez Urevc: Um, when we were preparing for this episode, you also mentioned that, um, like the fact that we are, when we are storing entities, we do one entity at a time, and there are very good reasons for that. Can you also talk about that a little bit?

[00:09:24] Mike Ryan: Well, the whole pipeline, the, uh, handles one entity or one logical piece of data at a time. And, um. There are multiple reasons for that. One big one is to handle references between, uh, entities, let's say, if you have a link from one node to another or a link from a node to a taxonomy term. And [00:10:00] historically these, um, the unique identifiers for Drupal entities have been Serial, uh, fields, um, serial numbers.

[00:10:15] Mike Ryan: And when you're creating new entities on your new system, it's, you're, you're going to end up with new numbers. If you really, really insist on it and work hard at it. You may be able to preserve your IDs, but this is, it's not recommended. It's much simpler to, um, rewrite the references and to rewrite the references.

[00:10:41] Mike Ryan: You can't do it in bulk because you won't know the new reference number. The pipeline does it one at a time so that you migrate one entity, it gets its new number, we keep [00:11:00] track of the mapping from its old ID to its new ID, and then when it's time to migrate the reference. We can fill in the new ID and everything is still pointing where it's supposed to be.

[00:11:13] Mike Ryan: Um, you know, we will talk, we talked more about this in the Transform talk before this. And another, I'm sorry. Yeah, go ahead Benji.

[00:11:30] Benji Fisher: Yeah, another reason to do it one entity at a time is that we want to leverage the other APIs that Drupal provides. The Entity API does not give us a method for creating 10 entities at a time.

[00:11:44] Benji Fisher: It gives us methods for creating one entity at a time. So just for to make, to manage the complexity of the Migrate API in Drupal core, that's a second reason for, for doing it one at a time. Now, in [00:12:00] particular cases, um, if you've got a huge number of things that you're creating and you're, uh, you know, that your migration is going to take hours or days.

[00:12:11] Benji Fisher: Um, and you know that this particular part of your project isn't going to require the sort of references that Mike was talking about, then on a particular project, it might make sense to have, um, some, some custom code, a custom destination plugin, for example, that does batch things to 10 or 100 at a time.

[00:12:33] Benji Fisher: Um, but that's not going to go into Drupal core. Because it won't always work and it would be a lot of added complexity.

[00:12:43] Mike Ryan: And, uh, one other reason to deal with one entity at a time is, uh, memory. If you've got a lot of data, you don't want to deal with the whole batch of data. At once, [00:13:00] and we are, uh, the, the migrate system is very performance conscious. It's got some built in memory, um, uh, sort of, uh, I'm not sure what you would call it, but it, it would, it will recognize if you're running low on memory.

[00:13:20] Mike Ryan: And, um, do some purging of internal caches and so on as needed to keep going. And if you're using, um, Drush to run your migrations as you should, it can, if necessary, um, respawn a new process, fresh process, if the, uh, if it's unable to reclaim enough memory.

[00:13:46] Janez Urevc: So Migrate will do that automatically, I mean Drush and Migrate will do that automatically behind the scenes without developer initiating it?

[00:13:56] Janez Urevc: Yes. That's great.[00:14:00]

[00:14:00] Janez Urevc: We are planning to have a talk on performance and I'm sure that we will talk about these sorts of things, uh, in detail in that episode. Um, Benji already mentioned core versus contrib. So what, what do we have in core? In terms of, uh, the load step and which interesting other things could we find in contrib space?

[00:14:33] Benji Fisher: So Core has, uh, support for migrating from Drupal 6 or Drupal 7 into modern Drupal, and so mostly that means entities. So nodes, taxonomy terms, users. Blocks, um,

[00:14:58] Benji Fisher: and, and then, um, [00:15:00] in contrib space, um, the sort of the, the most esoteric example I know of is a module called Commerce QuickBook WebConnect, which uses SOAP to, um, import data from QuickBooks into Drupal and to export data from Drupal to QuickBooks. And Lucas, Hedding is one of the maintainers of that module and it's, it's, it's lightly used and I don't think there's a Drupal 10 compatible version yet.

[00:15:34] Benji Fisher: Um, but I used it on a recent project. And, and looked at it and it's, uh, it's very interesting in the way it uses migrate to export data from Drupal. And I, I think Mike, uh, suggested how this works earlier, but it, it goes through the, um, the orders one at a time and the, the load [00:16:00] plugin it, it uses, or, or the, the destination plugin it uses for the load stage, um, exports data about a commerce order into the Drupal State system.

[00:16:11] Benji Fisher: Um, and then, um, a, it, it, it cuts off the migration after processing one row and then another part of the module takes over and extracts the data from the state system and generates a soap response. which then gets batched somehow. So, um, so as Mike said, you, you, you could be exporting to a CSV file or something.

[00:16:39] Benji Fisher: In this case, we're exporting to the state system. And then other parts of the module use that to get the data into QuickBooks. Um, less esoteric than that. Um, the only other sort of general purpose, uh, destination plugin I know is in the [00:17:00] Migrate Plus module. There's, uh, an explicit, uh, destination plugin for a custom SQL table.

[00:17:08] Benji Fisher: So if you have custom database tables in your project that you need to migrate, um, you can use the, uh, SQL table plugin from. Uh, migrate plus and that that doesn't use the entity system. It just writes directly to the SQL table.

[00:17:28] Janez Urevc: Um, I want to go back to the QuickBooks a little, a little bit, because I find this approach of migrating into state system and then doing SOAP requests. Um, very interesting. Do you, do you know why it was designed this way or should we get Lucas on, on team talks to explain us that?

[00:17:52] Benji Fisher: So I, I've never asked him about it.

[00:17:54] Benji Fisher: Um, and he, he wasn't the original author of the module, but, uh, but he, he did some work on it. [00:18:00] Um, but I'm, I'm pretty sure that the, uh, the reason they decided to use the migrate API for that is that it gives a way of tracking, um, the original entity ID and the exported ID. So Drupal has, um, as Mike mentioned, sequential IDs for each order.

[00:18:25] Benji Fisher: QuickBooks has its own way of keeping track of the orders. And, uh, the Migrate API provides a system for keeping track of which Drupal ID corresponds to which QuickBooks ID. And that's, uh, that's one, one of the, um, reasons for using, um, the Migrate API if you do need to keep track of, uh, of old and new IDs.

[00:18:51] Benji Fisher: Then that that's one argument for using migrate API rather than just some sort of custom code for, for exporting data.[00:19:00]

[00:19:01] Janez Urevc: That's a great point. I didn't think about it. Um, so I remember when, uh, we were still like the Drupal community was still developing Drupal 8. Um, it's been. You know, quite a long process and, um, also the discussion about including migrating to core, um, happened at that time and then eventually the decision that we will use it to migrate from Drupal 7 to 8, um, I remembered that.

[00:19:37] Janez Urevc: Um, back in those days, MongoDB, like the company behind the Mongo database, um, wanted to be like the first class citizen for Drupal, like providing out of the box. Support to run your, um, your Drupal site on Mongo instead of MySQL. [00:20:00] And I've been involved with Mongo quite a lot at that time, because I was working at Examiner and Examiner was using, uh, MongoDB for Drupal 7.

[00:20:10] Janez Urevc: But in Drupal 7, you still had to use, uh, MySQL database. Next to it. So you had two databases, two sets of, uh, of backups and all that. Um, so they wanted to provide the, the ability to, to be Mongo did the sole database for Drupal 8 and, and on. And I remember that, uh, Chx, CHX was working on that. Um, and he was really excited about Migrate because he realized that if we would be using Migrate as a standard to migrate from D7 to D8.

[00:20:51] Janez Urevc: You would basically just swap the destination plugin and instead of loading into MySQL, you would load in MongoDB. [00:21:00] Um, but then, then MongoDB company lost interest and, and, and stopped funding Chx to do that work. So that work was never completed. Um, it was like in a very early alpha stage. And the module is, is still on D.o. And I'm not sure what state is it at the moment, but, um, A, it would have been very cool to have this possibility and, um, B again, proves that, um, having the Load part of the Migrate pluggable is very useful. Uh, do you two have any other, like. Unusual or interesting cases related to Lpart, uh, that you've seen in the past or maybe any ideas how it could be used, but you've not seen it used that way yet.

[00:21:59] Benji Fisher: I [00:22:00] almost always create entities. I don't think I have any other examples of clever uses of the Load stage.

[00:22:08] Mike Ryan: It is a little exotic, you know, beyond. You know, if you want to export some CSVs for some reason,

[00:22:16] Janez Urevc: Yeah, it could be used for exporting, like similar to how WordPress exports or precise and XML you could use.

[00:22:25] Mike Ryan: Yeah. Although views export is probably easier for most of those cases.

[00:22:32] Janez Urevc: That's true.

[00:22:34] Janez Urevc: So I think that that's it for the L part. Um, this is also the end of the last, uh, episode in our ETL mini series. Uh, but we have some great team talks lined up. Uh, our goal is to put out one per week over the next [00:23:00] few months to support the community in the migration process from Drupal 7 to Drupal 10.

[00:23:05] Janez Urevc: Um, and as part of that, we're planning to talk about performance, which is something we care deeply about at Tag1. Um, and of course it applies to migrations as well, especially if you're handling really large data sets. Um, a full data migration can easily take over 12 hours or even more days. Um, and we'll do a handful of talks on this topic, including how to profile and tune a migration.

[00:23:34] Janez Urevc: we'll also do a talk on incremental migrations. Where you can include or exclude things, uh, and run a migrational subset of data to make it perform better. And every project owner wants their migration to be a success. We will dedicate an episode to discuss the most important factors for a successful Drupal 7 to 10 migration in order to help successfully navigate your migration [00:24:00] project.

[00:24:01] Janez Urevc: And other topics that we are planning to cover include porting custom code from Drupal 7 to Drupal 10, uh, the future of migrate tooling, how to port the team and, uh, so much more. We, we hope that you'll tune in and enjoy our upcoming team talks. A huge thank you to the Tag1 Team. Thank you, Benji Fisher and Mike Ryan.

[00:24:29] Janez Urevc: Um, make sure that you check out the other segments in this series. There will be links to them in the show notes, along with all the other links that we mentioned today. If you like this talk, please remember to upvote, subscribe, and share it. Uh, you can check our past talks at tag1..com/ttt. That's three Ts for Tag1 Team Talks.

[00:24:55] Janez Urevc: As always, we'd love to hear your feedback and any topic [00:25:00] suggestions. You can write us at TTT@ Tag1.Com. A big thank you to both of our guests and to everyone who tuned in. Thank you for joining us.

[00:25:12]

Drupal Migration Series

Transcript: Unraveling the ETL Data Migration Process : Understanding Load