This position is for an SRE Problem and Knowledge Management Team Lead within the enabling group, Site Reliability Engineering and Governance (SRE & Governance) department.
This role is expected to strategically lead the conduct of incident retrospective/ problem management operations and in other SRE activities in general which pertains to maintenance management that includes availability, performance, change management, monitoring, capacity planning & also the solutions offered derived from emergency response.
The Team Lead is to make sure that the retrospective activities are orchestrated & carried out effectively while promoting the blameless culture in accordance with the SRE principles.
Responsibilities:
- Mentor the team in the seamless facilitation & conduct of root cause analysis (RCA) activities from end to end
- Lead the facilitation for high-severity incidents liaising with top/ senior management and keeping the latter updated
- Prime focal point for presenting in the RCA Forum, Tech Risk Forum and other senior management meetings to report updates on retrospective findings & action plans
- Absorb new technology rapidly & apply effectively
- Communicate well with technical & non-technical colleagues
- Work to a high standard with agreed timescales
- Undertake any other tasks or duties that are reasonable & requested by the supervisor or a member of the senior management team.
- Do resource management to ensure problem management activities are carried out in an effective and efficient manner
- Provide available platforms and channels to ensure stakeholders are kept updated on results of retrospectives and RCA activities
- Able to demonstrate authority in the problem management calls.
- Point of contact for assigned incidents of higher severity (from incident retrospective calls all the way up to Management Report (MR) documentation and publishing
- Take accountability for initiatives on the enhancement activities related to SRE as a result of retrospectives
- Collaborates with Engineering Teams within SRE and with LOBs on enabling activities as part of the preventive measures
- Minimum 15 years of process improvement/ root cause analysis (RCA) exposure & involvement leading discussions as a problem manager or incident commander, preferably in the Technology & Operations space
- Experience with JIRA, Confluence, Jenkins, Nexus, SonarQube, Bit bucket, S3, Cloud Computing.
- Good exposure to logging & monitoring tools like Dynatrace, Prometheus, Grafana, ELG/ELK
- In depth understanding of Incident & Problem Management functions & activities (i.e. Hardware- & Software-related incident & problem management)
- Work with stakeholders & command centre in trouble shooting, escalating & solutioning critical site incidents.
- Identify recurring system/ application issues & work with cloud team, infra teams, product development, vendors & other stakeholders in investigating & resolving cause
- Maintain accurate documentation of incidents including impact details, timelines, steps taken for mitigation/resolution.
- Strong verbal & written communication skills particularly effective documentation skills
- Min 10+ yrs of software development or technical support or operations experience.
- Basic knowledge of Linux, AIX, Solaris and Windows
- Exposure to Enterprise databases e.g Oracle, SQL server, Maria DB, MongoDB & Sybase.
- Knowledge in systems & multi-tier application & network troubleshooting
- Essential knowledge & awareness of Public/Private/Hybrid cloud solutions.
Report job