Visier gives organizations a Workforce AI Edge: a set of AI-powered capabilities that help leaders understand the relationship between people and work, elevate the productivity of their employees, and win by adapting to change faster. The company is the global leader in AI-powered people analytics, workforce planning, and compensation management solutions. All Visier technology is underpinned by its Real-time People Data Platform, which uses AI to unlock the business-transforming potential of people data, work data, and the fusion of both.

Founded in 2010 by the pioneers of business intelligence, Visier has over 60,000 customers in 75 countries—including enterprises like BASF, Panasonic, Experian, Amgen, eBay, Ford Motor Company, and more.

Visier’s Shared Services SRE team is responsible for operating the cloud infrastructure underlying our technology platform and for working with the development teams to effectively use these technologies in production environments. We are also responsible for our AWS integration, API gateway, Cassandra, Kafka, Vault, and Consul implementations, data science workbench, and the network infrastructure and security that tie everything together.

Our job is to provide the infrastructure for our analytic platform and services to scale.

What you'll be doing...

Managing LLM deployments on major cloud platforms such as AWS and Azure
Managing AI application observability platforms such as Langsmith
Deploying and maintaining highly available services in AWS using Terraform, Cloudformation, and Jenkins
Debugging production issues at any level, from the hardware layers and the OS kernels all the way up to working hand in hand with the developers to improve our application behaviour.
Working with the Kong API gateway to provide secure & reliable API access to our customers and partners
Writing secure code to safeguard Visier and our customers' data, including developing our application security infrastructure
Optimizing our diagnostics infrastructure components like Splunk, Cloudwatch, and Prometheus
Supporting large clusters of 3rd party systems like Cassandra, Postgres, and Kafka
Preparing for and simulating disasters of all sorts. We’re mission-critical for our customers and need to stay up, no matter what
Work closely with other development teams to design the infrastructure to support application features.

What you'll bring to the table...

Extensive experience in enterprise-level scalability of services
Strong coding skills in Java, Scala, Python, or Groovy
Deep expertise in networking, network security, firewalls, routing, DNS, and advanced Linux systems and security
Hands-on proficiency with AWS services including EC2, S3, RDS, IAM, Lambda, and VPC
Strong experience with containerization technologies (Kubernetes, ECS) and managing them via Infrastructure as Code
Skilled in Infrastructure as Code (IaC) practices, with deep experience developing and maintaining Terraform code and modules
Strong knowledge of deployment and configuration management tools
Proven ability to perform deep troubleshooting and root cause analysis to resolve complex system issues
Experience with system security patching to maintain infrastructure integrity and resilience

Most importantly, you share our values...

You roll up your sleeves
You make it easy
You are proud
You never stop learning
You play to win

Save Apply

Report job

Staff Site Reliability Engineer

What you'll be doing...

What you'll bring to the table...

Site Reliability Engineer - Applied Machine Learning Engine (Singapore)

Site Reliability Engineer (Traffic) - Infrastructure Engineering

Site Reliability Engineer, Traffic Platform - Traffic SRE

Site Reliability Engineer, Traffic Platform - Traffic SRE - 2025 Start

Site Reliability Engineer, Applied Machine Learning Engine (Singapore)