SRE Engineer | LMA Recruitment Asia

Role Responsibilities

Act as a bridge between Development & Operations
Responsible for faster production incidents resumption, root cause identification, defects fixing
Responsible for identifying & delivering improvement opportunities such as, Manual task automations, Performance/Throughput improvements, Batch optimisation etc, that will ultimately improve the production service/stability
Responsible for identifying & implementing solutions to improve overall Operations efficiency, that should result in cost saves/avoidance

Strategy
Resiliency

Lead/part of SRE team to enhance application and infrastructure resiliency of service through self-healing and automated failovers - target a 99.99% up-time to customers.
Oversee the planned random disruption of production infrastructure to ensure accountability for building resilient, always-on systems.
Build resilience into the application so underlying system failures are handled gracefully and do not impact end users. Influence design/development teams to always be thinking of the rainy-day scenarios.

Efficiency

Identify opportunities to eliminate all manual and repeatable activities (toil) via tooling and automation
Reduce the number of repeat incidents by permanently fixing the underlying root cause of issues

Capacity Planning

Develop automated predictive analysis of future capacity needs and drive the proactive upgrade of service capacity well in advance
Using company's SDI (Software Defined Infrastructure) develop auto-scaling to deliver robust resilience to fluctuations in critical service demand
Continuously monitor service demand / capacity for any discrepancies or spikes

Business
Availability/Reliability

Take responsibility for meeting SLA/XLA expectations around the operability and reliability of our critical user service journeys, where our customers expect a 24x7x365 digital service offering. Examples of "always on" techniques to be used include caching, circuit breakers, dark and canary releases, store and service patterns and alternate user experience flows.
Lead, own, manage, monitor and optimize the reliability and health of all environments
Design, code, implement break fixes to improve service availability based on outcomes from thematic reviews

Latency & Performance

Drive conversation around development velocity using SLIs/SLOs data to ensure development velocity vs. service reliability is optimized in partnership with Product Teams
Iteratively review SLI/SLO/Error Budget policy to ensure the quantitative indicators of customer experience are accurate
Where an increased focus on reliability is required influence senior stakeholders to ensure resourcing / effort is made available

Processes
Transition to Production

Champion and evolve continuous delivery best practice standards to reduce release related incidents
Partner with development teams to ensure applications are designed with scale, resilience, and performance in mind.

Monitoring

Optimize monitoring to reduce false positive alerts
Creatively deepen monitoring capabilities leveraging the 3 tenets of observability - logs, metrics and traces
Ensure all critical user service journeys are traceable end to end
Ensure Production Solutions are fit for purpose. Where gaps are identified put a plan in place to uplift the toolset

People and Talent

Establish and manage SRE team when applicable
Drive efficient target operating model and enhance the existing capabilities of the team.
Lead through example and build the appropriate culture and values.
Ensure the provision of ongoing training and development of people, and ensure that holders of all critical functions are suitably skilled and qualified for their roles ensuring that they have effective supervision in place to mitigate any risks.
Set and monitor job descriptions and objectives for direct reports and provide feedback and rewards in line with their performance against those responsibilities and objectives

Risk Management

Identify key issues in the business areas being supported, and based on this information, put in place appropriate controls and measures to assess, monitor, control & mitigate risks.
Ensure a full understanding of the risk and control environment within Technology Services.
Ensure support procedures are in place and adhere to Group Security & Audit policies within Technology Services.
Active engagement with all audit issues arising in this support environment.

Governance

Responsible for assessing the effectiveness of the Group's arrangements to deliver effective governance, oversight and controls in the business and, if necessary, oversee changes in these areas
Awareness and understanding of the regulatory framework, in which the Group operates, and the regulatory requirements and expectations relevant to the role.
Responsible for delivering 'effective governance'; capability to challenge fellow executives effectively; and Willingness to work with any local regulators in an open and cooperative manner.

Regulatory & Business Conduct

Display exemplary conduct and live by the Group's Values and Code of Conduct.
Take personal responsibility for embedding the highest standards of ethics, including regulatory and business conduct. This includes understanding and ensuring compliance with, in letter and spirit, all applicable laws, regulations, guidelines and the Group Code of Conduct.
Lead Technology Services SRE to achieve the outcomes set out in the Bank's Conduct Principles: [Fair Outcomes for Clients; Effective Financial Markets; Financial Crime Compliance; The Right Environment.]*
Effectively and collaboratively identify, escalate, mitigate and resolve risk, conduct and compliance matters.

Key Stakeholders

Business Heads in the country and the group
Domain Heads in Tech Services
Country CIO and CTM
Business CIO
Development Head
Product Owners

Our Ideal Candidate
6-9 years of overall IT experience in Development/Production Support/Dev Ops/SRE, including atleast 2-4 years of experience in above tools and cloud platform.

Interested Applicants, please send your CVs in Microsoft Word Format to Nishana Rahim. Email: nishana.rahim@lmarecruitment.asia

Company Reg No.: 201131609D, Licence No.: 11C4684