KLAARA specialises in providing AI-driven solutions tailored to the needs of customers in the Capital Markets, Insurance, Reinsurance, and various other organizations. Our cutting-edge platform empowers businesses to transform their unstructured data into actionable intelligence, fueling better decision-making and driving competitive advantage.
You will collaborate closely with cross-functional teams including Machine Learning Engineering, Security, Software Engineers, Business Analysts, and Customers to ensure seamless delivery, performance testing, and documentation of our solutions. This is a hands-on role requiring strong technical acumen, cross-functional teamwork, and occasional code fixes.
You are expected to:
- Design, implement, and manage scalable cloud infrastructure (AWS, Azure, or GCP) tailored for AI/ML workloads.
- Lead and execute the implementation of software deployment packages
at customer sites, ensuring robust instrumentation and scalable
performance.
- Design, develop, and execute load and scalability tests to validate
system reliability under various conditions.
- Implement and maintain monitoring and alerting systems to identify
proactively and resolve issues in development, staging, and production
environments.
- Participate in incident response, root cause analysis, and
contribute to continuous improvement of system reliability.
- Develop and maintain automation scripts and tools to streamline
deployment, infrastructure provisioning, and routine operational tasks.
- Create and maintain clear, comprehensive documentation for
customers on deployment, configuration, and operations.
- Collaborate with internal development teams, business analysts, and
customers to align technical solutions with business requirements.
- Troubleshoot and resolve deployment and integration issues,
occasionally contributing fixes in TypeScript or Rust.
- Ensure compliance with security and regulatory requirements,
particularly in Banking environments.
- Maintain and optimize CI/CD, and MLOps pipelines and deployment automation
using Jenkins and Atlassian tools.
- Manage and maintain development, staging, and production
environments, ensuring consistency and high availability.
- Support and manage infrastructure components using Linux, Docker,
and Kubernetes.
- Integrate and manage application components such as Kong, Open
Telemetry, Kafka, RabbitMQ, and MySQL.
- Work with systems designed using microservice architecture,
ensuring seamless deployment and integration across services.
- Utilize Infrastructure as Code (IaC) tools such as Terraform,
Ansible, or Helm to provision and manage infrastructure effectively.
Requirements
- Ph.D or Master's Degree in Information Technology, Computer Science, Engineering,
or related field.
- Minimum of 10 years of experience in Cloud Operations, Infrastructure Engineering, and Automation, with at least 5 years in AWS or similar Cloud operations (Azure, GCP), preferably within Banking or Financial Services industry.
- Proven experience with CI/CD tools such as Jenkins and the
Atlassian suite (Bitbucket, Jira, Confluence).
- Solid understanding of mainstream cloud products and services including Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP), and OpenShift.
- Hands-on experience with Linux, Docker, Kubernetes, and
cloud-native deployment practices.
- Solid understanding of microservice architecture and experience
deploying and managing distributed systems.
- Experience with monitoring and observability tools, especially
OpenTelemetry, and managing alerts and incident response processes.
- Proficiency in automation and scripting (e.g., Bash, Python, or
similar) for deployment and infrastructure tasks.
- Experience in building and implementing Infrastructure as Code (IaC) tools such as
Terraform or Open Tofu.
- Experience with API and Microservices based architecture patterns for deploying ML models on cloud.
- Familiarity with API gateways (e.g., Kong), message brokers (Kafka,
RabbitMQ), and databases (MySQL, PostgreSQL).
- Ability to design and execute performance, load, and scalability
tests.
- Basic proficiency in TypeScript and/or Rust, sufficient for reading
code and implementing minor fixes.
- Strong understanding of security, compliance, and risk management
practices in financial services or banking.
- Excellent communication and collaboration skills; able to interact
with technical and non-technical stakeholders.
Report job