The Role:
This position is for an SRE Problem and Knowledge Management Team Lead within the enabling group, Site Reliability Engineering and Governance (SRE & Governance) department.
This role is expected to strategically lead the conduct of incident retrospective/ problem management operations and in other SRE activities in general which pertains to maintenance management that includes availability, performance, change management, monitoring, capacity planning & also the solutions offered derived from emergency response.
The Team Lead is to make sure that the retrospective activities are orchestrated & carried out effectively while promoting the blameless culture in accordance with the SRE principles.

Responsibilities:

Mentor the team in the seamless facilitation & conduct of root cause analysis (RCA) activities from end to end
Lead the facilitation for high-severity incidents liaising with top/ senior management and keeping the latter updated
Prime focal point for presenting in the RCA Forum, Tech Risk Forum and other senior management meetings to report updates on retrospective findings & action plans
Absorb new technology rapidly & apply effectively
Communicate well with technical & non-technical colleagues
Work to a high standard with agreed timescales
Undertake any other tasks or duties that are reasonable & requested by the supervisor or a member of the senior management team.
Do resource management to ensure problem management activities are carried out in an effective and efficient manner
Provide available platforms and channels to ensure stakeholders are kept updated on results of retrospectives and RCA activities
Able to demonstrate authority in the problem management calls.
Point of contact for assigned incidents of higher severity (from incident retrospective calls all the way up to Management Report (MR) documentation and publishing
Take accountability for initiatives on the enhancement activities related to SRE as a result of retrospectives
Collaborates with Engineering Teams within SRE and with LOBs on enabling activities as part of the preventive measures

Requirements:

Minimum 15 years of process improvement/ root cause analysis (RCA) exposure & involvement leading discussions as a problem manager or incident commander, preferably in the Technology & Operations space
Experience with JIRA, Confluence, Jenkins, Nexus, SonarQube, Bit bucket, S3, Cloud Computing.
Good exposure to logging & monitoring tools like Dynatrace, Prometheus, Grafana, ELG/ELK
In depth understanding of Incident & Problem Management functions & activities (i.e. Hardware- & Software-related incident & problem management)
Work with stakeholders & command centre in trouble shooting, escalating & solutioning critical site incidents.
Identify recurring system/ application issues & work with cloud team, infra teams, product development, vendors & other stakeholders in investigating & resolving cause
Maintain accurate documentation of incidents including impact details, timelines, steps taken for mitigation/resolution.
Strong verbal & written communication skills particularly effective documentation skills
Min 10+ yrs of software development or technical support or operations experience.
Basic knowledge of Linux, AIX, Solaris and Windows
Exposure to Enterprise databases e.g Oracle, SQL server, Maria DB, MongoDB & Sybase.
Knowledge in systems & multi-tier application & network troubleshooting
Essential knowledge & awareness of Public/Private/Hybrid cloud solutions.

Save Apply

Report job

VP, Problem & Knowledge Management Lead, SRE & Governance, Group Technology

Lecturer / Senior Lecturer in Information Science / Knowledge Management

Deputy/Assistant Director – AI Office, Enterprise Knowledge Dept (Contract)

Deputy/Assistant Director – AI Office, Enterprise Knowledge Department (Contract)

Cloud Infrastructure Engineer (PKI Knowledge and Endpoint Testing)

Senior Manager, Knowledge Workplace