Lead Site Reliability Engineer - High Performance Compute
Company: JPMorgan Chase & Co.
Location: Jersey City
Posted on: April 1, 2026
|
|
|
Job Description:
Description As a Lead Site Reliability Engineer at JPMorgan
Chase within the Markets Engineering & Architecture team, you will
solve complex and broad business problems with simple and
straightforward solutions. Through code and cloud infrastructure,
you will configure, maintain, monitor, and optimize applications
and their associated infrastructure to independently decompose and
iteratively improve on existing solutions. You are a significant
contributor to your team by sharing your knowledge of end-to-end
operations, availability, reliability, and scalability of your
application or platform. Job responsibilities Guide and assist
others in building appropriate level designs and gaining consensus
from peers where appropriate. Collaborate with software engineers
and teams to design and implement deployment approaches using
automated continuous integration and continuous delivery pipelines.
Design, develop, test, and implement availability, reliability,
scalability, and solutions in applications in collaboration with
other software engineers and teams. Implement infrastructure,
configuration, and network as code for applications and platforms
within your remit. Collaborate with technical experts, key
stakeholders, and team members to resolve complex problems.
Understand service level indicators and utilize service level
objectives to proactively resolve issues before they impact
customers. Support the adoption of site reliability engineering
best practices within your team. Lead incident response efforts as
a subject matter expert on the High Performance Computing platform,
including restoration of service, root cause analysis, and
engineering preventative measures. Contribute to client teams by
providing resilient architecture implementations and running chaos
simulations to validate platform resiliency. Build and maintain
standard infrastructure as code modules for reuse by development
teams and other business units. Participate in architecture
resiliency reviews to provide guidance on cloud design decisions,
standards, and operational practices, while developing skills to
attain Subject Matter Expertise in at least one technical
implementation within a technical domain. Required qualifications,
capabilities, and skills Formal training or certification in site
reliability engineering concepts and 5 years of applied experience.
Demonstrate applied experience in contributing to the reliability
of production applications, with proficiency in site reliability
culture and principles, and familiarity with implementing site
reliability within applications or platforms. Possess 5 years of
experience in at least one programming language such as Python,
Java/Spring Boot, or Golang, along with proficient knowledge of
software applications and technical processes within a given
technical discipline, including supporting and delivering public
cloud applications and monitoring technologies like Graphic
Processing Units and IBM Symphony. Have 5 years of experience in
observability practices, including white and black box monitoring,
service level objective alerting, and telemetry collection using
tools such as Grafana, Dynatrace, Prometheus, Datadog, and Splunk.
Experience with continuous integration and continuous delivery
tools like Jenkins, GitLab, and Spinnaker, as well as hands-on
experience with container and container orchestration technologies
such as ECS, Kubernetes, and Docker, and troubleshooting common
networking technologies and issues. Contribute to large and
collaborative teams by presenting information logically and timely
with compelling language and limited supervision, while proactively
recognizing roadblocks and demonstrating an interest in learning
technologies that facilitate innovation. Experience working with
Arm-based servers, Terraform, Amazon Machine Images, and setting up
and configuring OpenTelemetry agents and collectors, with
proficiency in AWS and cloud automation tools and technologies like
Lambda, CodePipeline, Ansible, and Terraform. Preferred
qualifications, capabilities, and skills Ability to contribute to
large and collaborative teams by presenting information in a
logical and timely manner with compelling language and limited
supervision Ability to identify new technologies and relevant
solutions to ensure design constraints are met by the software
team.
Keywords: JPMorgan Chase & Co., Bayonne , Lead Site Reliability Engineer - High Performance Compute, IT / Software / Systems , Jersey City, New Jersey