Site Reliability Engineers (SREs) are responsible for keeping all Leverege production systems running smoothly and meeting SLAs for customer projects. A good SRE must apply sound engineering principles to enhance automation for all deployments, manage monitoring and alerting systems, and stay on top of potential issues due to scale, security vulnerabilities, and infrastructure decisions. Experience with cloud-native software (e.g. Docker, Kubernetes) and general knowledge of networking and distributed systems are a must.
Leverege powers multiple large scale, business critical solutions for Fortune 500 companies with over a million devices already connected on the platform. We are key technology partners of both Google and AWS and currently power the largest Low-Power Wide-Area Network (LPWAN) IoT solution in North America. Leverege stays up to date with the latest technologies to improve the scale and reliability of our IoT platform and allow developers to continually build new IoT solutions.
- Manage OpsGenie rotation and respond to incidents to meet SLAs for our customers
- Run the infrastructure with Terraform, Kubernetes, and Helm on GCP and AWS
- Improve monitoring and alerting systems to catch incidents and reduce false positives
- Implement best SRE practices in documenting and making improvements to infrastructure
- Build internal tools to manage multiple customer projects
- Debug production issues across the entire tech stack (i.e. VMs, containers, cloud, front end)
- Grow the CI/CD pipeline at Leverege
- Design, build, and maintain core infrastructure pieces
- 2+ years experience in a fast-paced professional setting (startup experience is a plus)
- 2+ years experience with any of the cloud providers (AWS and GCP are a plus)
- 1+ year experience with Docker and Kubernetes
- Familiarity with cloud-native tools such as Prometheus, Grafana, Helm, Chart Museum, Istio, etc
- Strong debugging skills on distributed systems