In this blog post, I will show you how to debug and troubleshoot issues with your AWS DMS Tasks during the migration of large tables from On-Premise to Cloud.
As part of our Cloud Migration strategy, we had to migrate data from On-Premise SQL Server tables to AWS Aurora MySQL tables. One of the tables we wanted to migrate data had about 700 million rows. But the AWS DMS task kept restarting by itself after 4-5 hours from being 60-70% complete. The other small tables we wanted to migrate data completed successfully without issues. So this issue was specific to scenarios where we had more than 500-600 million rows and DMS kept restarting by itself during the data migration.
The first step to troubleshoot this error was by enabling error logs in AWS DMS. Since the DMS task ran for a couple of hours, it was clear that there were no issues with the DMS task start event. Maybe it pointed to the fact that it was not able to move some specific data from the source table to destination.
2019-05-25T19:24:58 [TASK_MANAGER ]I: Task running full load and CDC in resume mode after recoverable error, retry #3 (replicationtask.c:1239)
For troubleshooting any issues with AWS DMS, it is necessary to have logs enabled. The DMS logs would typically give a better picture and helps find errors or warnings that would indicate the root cause of the failure. If the logs are not available there is nothing much you can do from a detailed troubleshooting analysis perspective. So basically next step is to turn on DMS logs and kick the job again and validate if the errors are captured in the logs.
If logs are not enabled, you need to set up a new task with logging enabled so if and when it errors out, you can take a look and troubleshoot the same.
Please find below the list of all logger levels in a migration phase:
- LOGGER_SEVERITY_DEFAULT – Default logging
- LOGGER_SEVERITY_ERROR – Log only errors for the appropriate phase
- LOGGER_SEVERITY_WARNING – Log only warnings
- LOGGER_SEVERITY_INFO – Log only info lines
- LOGGER_SEVERITY_DEBUG – Log in debug mode (more verbose than default)
- LOGGER_SEVERITY_DETAILED_DEBUG – Most verbose logging
The default log level severity is ‘LOGGER_SEVERITY_DEFAULT‘
AWS DMS restarts table loading from the beginning when it has not finished the initial load of a table. When a task is restarted, AWS DMS does not reload tables that completed the initial load but will reload tables from the beginning when the initial load did not complete.
Here’s the troubleshooting document for DMS for reference: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Troubleshooting.html
We reviewed the DMS logs but were not able to find anything concrete about the root cause. We wanted some additional information to troubleshoot this error and hence reached out to AWS Support to investigate why the DMS task errored out.
The AWS DMS team got back to us with some suggestions before another migration run. They recommended to set up a new task for migrating only the table which fails right now after making below two changes:
1. Add the below Extra Connection Attributes to the source endpoint :
cdcTimeout=1200 — by default the value is set to 600 seconds
Significance of ignoreTxnCtxValidityCheck=false –> This internal property tells to replicate and not give an error on such events which can avoid the task failure.
2. Enable detailed logging for the new task which would capture more detailed logs for this specific investigation. Enable detailed debugging for — SOURCE_CAPTURE, TARGET_APPLY, SOURCE_UNLOAD, TARGET_LOAD
Detailed logging is something you can try if the errors aren’t sometimes visible on the first level of logging. So that’s not something you should always have, only in scenarios where you might need more detailed logs to understand what’s happening in the migration when issues come up.
So whenever you set up new tasks, these configuration changes are not needed initially. You can stick with the original configuration. If you run into errors for which you need more data, you can enable it to debug the issue.
Detailed logging consumes more storage on the replication instance, so it is not suggested to be enabled always. It is typically recommended if the default logging already doesn’t have enough information, and we need more detailed information for a deep dive.
After adding the above configuration settings the DMS task for the large table completed successfully and we were able to migrate 620 million rows of data from OnPremise to AWS.
Check out some of my other recent blogs –
- Cloud-Native Application Security￼
- Infrastructure as Code for Cloud-Native Applications
- Monitoring Kubernetes in Production
- Securing Cloud-Native Applications
- Top 5 Elasticsearch Metrics to Monitor
Categories: AWS, Cloud Migration
Thanks for sharing this. We are also facing the same issue. But can you confirm it is IgnoreTxnCtxValidityCheck=false & not IgnoreTxnCtxValidityCheck=true. Based on your description of this flag, shouldn’t the flag be ‘true’ if we want DMS to ignore these errors
Can you put some more light on the below 2 parameters. Its not very clear how changing these 2 parameter did not allow the task to restart.
cdcTimeout=1200 — by default the value is set to 600 seconds — > what does the seconds mean here..?
ignoreTxnCtxValidityCheck=false –> what does this actually do…? As per what is written it ignores the error and keep proceeding without allowing job to fail. Isnt this a violation in other cases where the error might me of a concern. How to distinguish in that case.
I have created dms task to migrated data from one of ec2 instance which sits in another region with oracle database and target as postgresql . Taks is completed without any errors but and we haven’t seen any data in target database. We have use SCT for schema conversion.
Can someone help me on this ? DMS task is successful, but no data is migrated from source to target
Source : EC2 with oracl db in another region
Target : postgresql
Connected throug VPC Peering , SCT for schema conversion.
Thanks for posting your experience. One property that assisted my migration with large tables, was to reduce the ‘commit rate’ from the default 10,000 rows to 3,000 rows. Although, the task took a little longer, it reduced memory overhead and avoided errors. Another succesful approach we used with tables that had numerous LOB’s was to break the replication into multiple tasks based on an ID range.
Thaanks for writing this