Avoiding Data Loss - S3 Lifecycle Rules During Terraform Version Migrations
Your past is a lesson. Not a life sentence. Forgive yourself and focus on the future. -Mel Robbins
Avoiding Data Loss: S3 Lifecycle Rules During Terraform Version Migrations
The Incident
Recently, we encountered a significant issue during a Terraform version upgrade from 0.12 to 0.13 that resulted in unexpected data loss in AWS S3. The incident occurred when migrating S3 bucket lifecycle rules, specifically relating to the change in syntax for prefix filters. This post examines what happened, why it happened, and how to prevent similar issues.
Background
Our S3 bucket had three lifecycle rules:
- Archive files to Glacier after 30 days
- Expire [Directory-A] files after 1054 days
- Expire [Directory-B] files after 14 days
Original Configuration (Terraform 0.12)
lifecycle_rule:
- id: "archive"
enabled: true
transition:
- days: 30
storage_class: "GLACIER"
- id: "expire_objects_dir_a"
enabled: true
prefix: "[Directory-A]/" # Old syntax
expiration:
days: 1054
- id: "expire_objects_dir_b"
enabled: true
prefix: "[Directory-B]/" # Old syntax
expiration:
days: 14
New Configuration (Terraform 0.13)
lifecycle_rule:
- id: "archive"
enabled: true
transition:
- days: 30
storage_class: "GLACIER"
- id: "expire_objects_dir_a"
enabled: true
filter: # New syntax
prefix: "[Directory-A]/"
expiration:
days: 1054
- id: "expire_objects_dir_b"
enabled: true
filter: # New syntax
prefix: "[Directory-B]/"
expiration:
days: 14
What Went Wrong
During the migration, a critical issue occurred:
- The prefix filter syntax changed from direct notation to a filter block
- During the transition, the filter configuration was temporarily missing
- Without prefix filters, the rules applied globally to the entire bucket
- The shortest expiration rule (14 days) was applied to all files
- Result: All files older than 14 days were deleted, including archived Glacier files
Impact
- Loss of archived data in Glacier storage
- Unintended deletion of files outside the intended prefixes
- Potential business impact due to data loss
- Time and resources spent on incident investigation and recovery
Root Cause Analysis
The root cause was a combination of factors:
- Syntax changes in Terraform version upgrade
- Missing validation of lifecycle rule changes
- Insufficient review of terraform plan output
- No staged rollout of version upgrade
Prevention Measures
1. Pre-Migration Steps
- Document all existing lifecycle rules
- Take inventory of critical data
- Create backup of critical configurations
- Test migration in a non-production environment
2. During Migration
- Use terraform plan with extra scrutiny on lifecycle rules
- Use
-target
flag for gradual rollout - Implement changes in stages
- Verify each rule’s scope after application
3. Post-Migration Validation
# Verify lifecycle rules
aws s3api get-bucket-lifecycle-configuration --bucket your-bucket-name
# Check prefix filters are correctly applied
aws s3api list-objects-v2 --bucket your-bucket-name --prefix "[Directory-A]/" --query 'Contents[].Key'
4. Best Practices
- Implement version control for terraform configurations
- Use terraform workspaces for environment separation
- Create automated tests for infrastructure changes
- Maintain detailed documentation of S3 bucket policies
Recovery Steps If Incident Occurs
- Immediate Actions:
- Disable problematic lifecycle rules
- Document affected files and timeframes
- Notify stakeholders
- Recovery Process:
- Restore from backups if available
- Recreate data from source systems if possible
- Verify data integrity after restoration
- Post-Recovery:
- Update documentation
- Implement additional safeguards
- Conduct thorough review of similar configurations
Conclusion
This incident highlights the importance of careful planning and validation during infrastructure version upgrades. The key takeaways are:
- Always review terraform plan output thoroughly
- Implement gradual rollouts for version upgrades
- Maintain comprehensive backup strategies
- Test migrations in non-production environments
- Document and verify all lifecycle rules before and after changes
Remember: S3 lifecycle rules are powerful tools but require careful management, especially during infrastructure changes. A systematic approach to changes and thorough validation can prevent unintended data loss.
–HTH–