Team introduction:
Build Reliability at Global Scale
Every time a short video is posted or viewed on TikTok, our team is working behind the scenes to make sure it happens instantly and reliably. The Short Video Reliability team blends deep systems expertise with large-scale architecture design to keep TikTok running smoothly for billions of users.
We design for the unexpected. Whether it’s a viral trend flooding the platform, a major global event, a cross-region migration, or disaster recovery, our systems are built to adapt and thrive.
We’re now looking for experienced engineers and architects to join our Singapore team. In this role, you’ll design, build, and scale the core reliability infrastructure that underpins TikTok’s short video ecosystem. Your work will directly shape the performance, resilience, and evolution of one of the most-used platforms in the world.
Responsibilities:
- Architect and build self-healing systems that adapt to infrastructure changes, migrations, and global-scale challenges
- Design smart traffic and load management to keep performance steady during viral spikes, large events, and global campaigns
- Develop monitoring, alerting, and automation that spots and fixes issues before they affect users
- Lead the creation of reliability frameworks for topology mapping, capacity planning, automated recovery, and disaster readiness
- Continuously refine system architecture for better performance, fault tolerance, and maintainability
- Apply chaos engineering, fault injection, and failure simulations to stress-test our systems
- Use A/B testing to measure the real-world impact of your improvements
- Mentor engineers and help set the team’s technical direction