Service Reliability Engineering (SRE) Lead
Arlington, Virginia, United States, Remote
Share on Twitter Share on Facebook Share on LinkedIn Copy link
Excella is a leading provider of Agile software development and data and analytics solutions to clients in the federal, commercial and non-profit sectors. We believe that great work leads to great things –- our experts measure success by the positive impact we make on our clients, community, and colleagues. We are growing fast and need passionate, innovative people who love working with technology and are ready to make an impact. Here's what you can expect from us:
- Workplace sites look different for everyone – whether it’s your home or the office, we believe in a flexible work/life balance that supports you regardless of your location. We offer a home office allowance that can be used for home office furniture/equipment, a daily pass for a coworking space, etc. Our commute reimbursement plan has you covered for whether you bike, Metro, or drive to work.
- We offer top of industry medical, dental, and vision benefits with multiple options to choose from such as an employer-contributed health savings account, infertility coverage, and orthodontia so you can select the plan that works best for you.
- Regardless of what stage of life you’re in, Excella wants to support you. We provide 8 weeks of Parental Leave, discounted pet insurance, and a Care.com membership with 3 back-up emergency child or elder care days annually – all available to you on your first day.
- Starting day one, every employee is bonus eligible and receives 15 days of paid vacation, 6 federal holidays, and 4 floating holidays.
- Diversity and inclusion matter. Excella created and continues to support employee led-affinity groups and the Inclusion Diversity Equity Ambassador (IDEA) team, a cross-functional employee-led initiative to continually foster innovation and increase inclusion within Excella.
- We have a "bring your own device" workplace and will share the cost of a new computer of your choice -- Mac or PC. It's up to you.
- We'll invest in your career by providing 3 days of paid professional development every year, including travel and registration fees to attend classes and conferences.
- We encourage mindfulness and overall well-being through employee wellness events, a HeadSpace membership, as well as access to TalkSpace and mental health coverage through our medical plans.
Overview:
Engineering Leaders at Excella are consultants and thought leaders with great business and technology skills, who are responsible for providing expertise to the teams as we deliver game changing products and applications for commercial and government clients. The SRE Lead is part of a senior team leading all capabilities Excella offers to client.
You might be the right person for this role if:
- Automation is always front-of-mind for you: you’re always on the prowl to reduce the number of manual steps and write automation so computers can do what computers are good at and humans can do what humans are good at. When you’re automating tasks, you think about how to automate something and what not to automate.
- You love helping people with different perspectives work together to deliver highly resilient services to production quickly and frequently.
- You recognize the relationship between the pace of innovation and product stability and establish target availability and error budgets to manage the tradeoffs across development and SRE teams.
- You have the curiosity to understand why systems and services behave the way they do in complex technical environments.
- You’re passionate about enabling and running systems that align technology to business outcomes.
- You’re experienced making systems humane and sustainable for everyone involved.
Responsibilities:
The SRE Lead is responsible for shaping the scope and expertise for SRE practices at Excella. You’ll be responsible for building reliability and resiliency into cloud-based and hybrid infrastructure, tools, services and processes working with our development team, plus establishing practices for supporting, and running them that allow us to keep services highly available to our clients, easily supportable by our developers, and operable for the company.
SRE Lead responsibilities include:
- Working with clients to establish and lead SRE teams to ensure the solutions we deliver consistently meet target availability levels.
- Working closely with development teams to create resilient systems that are able to run and repair themselves with minimal human interaction.
- Evolving incident management process and tools to respond swiftly to critical incidents, provide transparent communications on incident status, introduce playbooks where necessary to reduce MTTR and conduct incident reviews for continuous improvement.
- Evolving release engineering practices to implement progressive rollouts with the ability to detect issues and remediate quickly safely when required.
- Establishing and implementing observability to provide visibility into system health and availability.
- Leading capacity planning efforts to create accurate demand forecasts and conduct regular load testing.
- Working with account leadership to develop and manage effective and sustainable on-call rotations.
- Maintaining, monitoring, and reporting key performance indicators
Qualifications:
You will have experience with using, supporting, administering, and leveraging tools in the following areas:
- 5+ years of hands-on technical delivery experience, with preference given to recent experience in an SRE model
- 3+ years of experience leading technical teams
- Operational experience administering and managing fleets of Linux and Windows servers
- Amazon AWS, S3, EC2, Lambda, CloudFront, ELB, EKS
- Container technology and orchestration: Kubernetes and Docker
- CI/CD tools, such as Jenkins, TeamCity, GitLab, Bamboo, TravisCI, or CircleCI
- Monitoring tools: eg. New Relic, Splunk, ELK
- Incident Management Tools: eg. VictorOps, OpsGenie, PagerDuty
- Configuration Management tools: eg. Chef, Puppet, Ansible
- Version control tools, such as GitHub or BitBucket
- Integration of testing tools and services, such as Selenium, Cucumber, JUnit, and JMeter
- Security testing/compliance integrations: Nexus, Chef Compliance
- Scripting languages): eg. Python, Bash, Ruby, Make
- Managing and operating SQL and NOSQL databases like Postgres and Mongo
- Artifact management: Artifactory, Yum/Apt, Chocolatey, Nexus
Excella is an equal opportunity/affirmative action employer. All qualified applicants will receive consideration for employment without regard to sex, gender identity, sexual orientation, race, color, religion, national origin, disability, protected veteran status, age, or any other characteristic protected by law.