Job Description:
- SLOs & error budgets - Define, track, and evangelize latency and availability targets for our payment APIs.
- Observability - Deploy Cloud Monitoring, Cloud Trace, Error Reporting, and dashboards; integrate alerts via Incident.io and Slack for on-call.
- Incident lifecycle - Establish blameless postmortems, guardrails, and runbooks to drive learning and prevent recurrence.
- CI/CD golden path - Codify Cloud Build pipelines and automated canary rollouts for Cloud Functions / Cloud Run.
- Infrastructure as Code - Manage GCP resources; embed security, IAM least-privilege, and cost controls by default.
- Performance & cost tuning - Profile hot paths (BigQuery, Firestore, Pub/Sub), and implement caching or concurrency improvements to keep user latency < 100 ms.
- Developer tooling - Eliminate toil by improving local-to-prod parity, secrets management, and spinning up environments with a single command.
- Culture carrier - Instill reliability thinking across engineering and product as the first platform-focused hire.
Requirements:
- At least 5+ years of experience building/operating production systems at scale, ideally on Google Cloud or a similar serverless stack, ideally in fast-paced or startup settings.
- Hands‑on Fluency with Firebase, Cloud Build, Cloud Run/Functions, Pub/Sub, Cloud SQL/Spanner, VPC Service Controls.
- Strong coding in Python or Go for automation, with an eye on maintainability.
- Demonstrated record of driving observability, on‑call and cost optimisation in a fast‑moving environment.
- Excellent collaboration and communication skills to work effectively with cross-functional teams.
- Experience in payments, PCI‑DSS, or crypto settlement flows is a bonus.
Tech note: we are 99 % serverless . There are no pet VMs to patch, but the stakes are higher: every cold‑start, DB connection pool and retry policy can impact real money transfers. You’ll architect for resiliency and velocity.