Lead Site Reliability Engineer
RemoteKöln, Nordrhein-Westfalen, GermanyEngineering
...is Germany's best-known AI company. We develop neural networks to help people work with language. With DeepL Translator, we have created the world's best machine translation system and made it available free of charge to everyone online. Over the next few years, we aim to make DeepL the world's leading language technology company.
Our goal is to overcome language barriers and bring cultures closer together.
What distinguishes us from other companies?
DeepL (formerly Linguee) was founded by developers and researchers. We focus on the development of new, exciting products, which is why we spend a lot of time actively researching the latest topics. We understand the challenges of developing new products and try to meet them with an agile and dynamic way of working. Our work culture is very open because we want our employees to feel comfortable. In our daily work we use modern technologies - not only to translate texts, but also to create the world's best dictionaries, and solve other language problems.
When we tell people about DeepL as an employer, reactions are overwhelmingly positive. Maybe it's because they have enjoyed our services, or maybe they just want to get on board with our quest to break down language barriers and facilitate communication.
We are constantly looking for outstanding employees! Currently, we offer remote work in Germany, the Netherlands, the UK, and Poland. Whether you would like to work from home in one of these countries or from one of our offices in Amsterdam, London, Cologne, or Paderborn: the choice is yours. No matter where you choose to work from, our way of working is designed to make you an essential part of the team.
What will you be doing at DeepL? Your Responsibilities
We are looking for a Site Reliability Engineer (SRE) to kickstart our SRE team. As an SRE, you will be the bridge between the Kubernetes platform and software development teams, ensuring the reliability, scalability, and performance of our systems. You will work closely with 1-2 cross-functional software development teams, cultivating a deep partnership to help them deploy and manage their applications on our in-house platform infrastructure, implementing and teaching best practices.
The SRE team's responsibility will be to constantly drive for improvement of our reliability with the goal of taking on-call for all production services.
You will work on standardizing our deployments, mentoring developers and collaborating with our internal developer platform team on improving the Kubernetes-based platform.
The ideal candidate will have a deep passion for collaboration and a background in software development.
- Lead, mentor, and manage a team of Site Reliability Engineers; developing short- and long-term goals together with the overall platform team and communicate clear objectives
- Develop and maintain a deep partnership with 1-2 software development teams to understand their application requirements and to help them deploy and manage their applications on the platform infrastructure
- Work closely with the platform team to ensure that the platform infrastructure provided to the SRE team is reliable, scalable, and meets the requirements of the organization's applications
- Collaborate with the platform team to continuously improve the reliability and performance of the platform infrastructure through automated testing, monitoring, and proactive maintenance
- Develop and maintain the necessary tooling, such as monitoring, alerting, and logging systems, to ensure the health and availability of the platform infrastructure.
- Design and track updates to Service Level Objectives (SLO) and Service Level Indicators (SLI) for critical services, ensuring they are met or exceeded
- Troubleshoot issues related to the platform infrastructure, participate in incident response and blameless post-mortem activities to identify and address the root cause of any platform-related incidents, and implement preventative measures to avoid similar issues in the future.
What we offer
- We are a distributed workforce enabling our employees to work from the comfort of your home office in Germany, Poland, Netherlands, or the UK or in one of our comfortable offices
- State-of-the-art equipment for your workplace
- Almost completely open-source technology on the inside - if we run it, we can fix it ourselves
- Operation at scale for products used by more than 100 million people worldwide
- A friendly, international, and highly committed team with a lot of trust, and a concise decision-making process
- Meaningful work: We break down language barriers worldwide and bring different cultures closer together
- 4+ years of experience as a Site Reliability Engineer, with proven experience in implementing SLOs, SLIs and other best practices from the SRE methodology
- 2+ years of managing infrastructure teams at high-traffic websites or online services at scale
- Experience with either on-premise bare-metal or cloud-based infrastructure at large scale and global distribution
- You bridge the gap between developing software and running infrastructure, bringing solid experience in both
- You are an excellent communicator and have the desire to help the teams you are working with
- Advanced hands-on experience with Kubernetes, Prometheus and Grafana and alerting tools like PagerDuty
- You have a background in software development, optimally in backend applications
- You're not afraid to get your hands dirty - DeepL is scaling rapidly, and there is always something to do. We are rapidly trying to close the gap between our on-prem infrastructure and the usual cloud provider features such as managed databases
- Strong knowledge of Kubernetes and general application security, for example, network filtering and mTLS for app-to-app communication
If you don't tick all of the boxes above but feel like you're the right person: Don't worry, give it a shot!
We are looking forward to your application!