In this blog post, I will show you how to debug and troubleshoot issues with your AWS DMS Tasks during the migration of large tables from On-Premise to Cloud.
As part of our Cloud Migration strategy, we had to migrate data from On-Premise SQL Server tables to AWS Aurora MySQL tables. One of the tables we wanted to migrate data had about 700 million rows. But the AWS DMS task kept restarting by itself after 4-5 hours from being 60-70% complete. The other small tables we wanted to migrate data completed successfully without issues. So this issue was specific to scenarios where we had more than 500-600 million rows and DMS kept restarting by itself during the data migration.
The first step to troubleshoot this error was by enabling error logs in AWS DMS. Since the DMS task ran for a couple of hours, it was clear that there were no issues with the DMS task start event. Maybe it pointed to the fact that it was not able to move some specific data from the source table to destination.
2019-05-25T19:24:58 [TASK_MANAGER ]I: Task running full load and CDC in resume mode after recoverable error, retry #3 (replicationtask.c:1239)
For troubleshooting any issues with AWS DMS, it is necessary to have logs enabled. The DMS logs would typically give a better picture and helps find errors or warnings that would indicate the root cause of the failure. If the logs are not available there is nothing much you can do from a detailed troubleshooting analysis perspective. So basically next step is to turn on DMS logs and kick the job again and validate if the errors are captured in the logs.
If logs are not enabled, you need to set up a new task with logging enabled so if and when it errors out, you can take a look and troubleshoot the same.
Please find below the list of all logger levels in a migration phase:
- LOGGER_SEVERITY_DEFAULT – Default logging
- LOGGER_SEVERITY_ERROR – Log only errors for the appropriate phase
- LOGGER_SEVERITY_WARNING – Log only warnings
- LOGGER_SEVERITY_INFO – Log only info lines
- LOGGER_SEVERITY_DEBUG – Log in debug mode (more verbose than default)
- LOGGER_SEVERITY_DETAILED_DEBUG – Most verbose logging
The default log level severity is ‘LOGGER_SEVERITY_DEFAULT‘
AWS DMS restarts table loading from the beginning when it has not finished the initial load of a table. When a task is restarted, AWS DMS does not reload tables that completed the initial load but will reload tables from the beginning when the initial load did not complete.
Here’s the troubleshooting document for DMS for reference: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Troubleshooting.html
We reviewed the DMS logs but were not able to find anything concrete about the root cause. We wanted some additional information to troubleshoot this error and hence reached out to AWS Support to investigate why the DMS task errored out.
The AWS DMS team got back to us with some suggestions before another migration run. They recommended to set up a new task for migrating only the table which fails right now after making below two changes:
1. Add the below Extra Connection Attributes to the source endpoint :
cdcTimeout=1200 — by default the value is set to 600 seconds
Significance of ignoreTxnCtxValidityCheck=false –> This internal property tells to replicate and not give an error on such events which can avoid the task failure.
2. Enable detailed logging for the new task which would capture more detailed logs for this specific investigation. Enable detailed debugging for — SOURCE_CAPTURE, TARGET_APPLY, SOURCE_UNLOAD, TARGET_LOAD
Detailed logging is something you can try if the errors aren’t sometimes visible on the first level of logging. So that’s not something you should always have, only in scenarios where you might need more detailed logs to understand what’s happening in the migration when issues come up.
So whenever you set up new tasks, these configuration changes are not needed initially. You can stick with the original configuration. If you run into errors for which you need more data, you can enable it to debug the issue.
Detailed logging consumes more storage on the replication instance, so it is not suggested to be enabled always. It is typically recommended if the default logging already doesn’t have enough information, and we need more detailed information for a deep dive.
After adding the above configuration settings the DMS task for the large table completed successfully and we were able to migrate 620 million rows of data from OnPremise to AWS.
Check out some of my other recent blogs –
- Cloud-Native Application Security￼
- Infrastructure as Code for Cloud-Native Applications
- Monitoring Kubernetes in Production
- Securing Cloud-Native Applications
- Top 5 Elasticsearch Metrics to Monitor