We're seeking a Senior Site Reliability Engineer who excels at working at the Operational side of DevOps. Attention to detail, proactivity, and problem-solving skills are key, as is the ability to communicate and collaborate effectively.
Job description Position: Senior SRE Engineer within Platform Operations and Support
• A service minded team player with a quality driven approach
• Manage and dispatch incident and service requests.
• Provide high quality support, drive trouble shooting, RCAs and be advisor to Dev teams
• Be responsible for maintaining the platform availability, shorten time to market for new features, and improve performance.
• Play a crucial role in troubleshooting and quality assurance from an end-to-end perspective.
• Focus on understanding, monitoring, and improving the production system, actively preventing future incidents.
• Be a leading star for continuous improvements and innovations. Overview of responsibilities System support & troubleshooting
• Guiding and coordinating junior colleagues within the team.
• Assist in initial technical analysis for production incidents.
• Support development team in building capabilities for alerts and monitoring.
• Conduct code review for reported cases, fixes development, and delivery. Infrastructure Automation and Configuration Management • Develop and maintain automation tools, scripts, and configuration management systems.
• Implement Infrastructure as Code (IaC) practices using tools like Ansible, Terraform, or Kubernetes.
• Collaborate with development and operations teams to automate build, test, and deployment processes. Reliability Engineering and Resilience
• Design and implement systems and processes to enhance infrastructure reliability and resilience.
• Continuously improve system reliability by analyzing logs and trends, identifying areas for improvement, and implementing preventative measures. System Monitoring and Incident Response
• Develop and manage monitoring tools and systems to track software and infrastructure health, performance, security, and availability.
• Set up alerts, dashboards, and metrics for proactive detection and response to incidents.
• Investigate and diagnose root causes of incidents and work towards resolution in a timely manner. Continuous Improvement and Collaboration
• Drive a culture of continuous improvement by identifying areas for automation and efficiency.
• Document procedures, incidents, and best practices for knowledge sharing and team efficiency.
• Stay updated on industry trends and emerging technologies to propose innovative solutions.
• Collaborate closely with cross-functional teams to ensure smooth operation of systems. Required skills & experience.
• Bachelor's degree in computer science, Engineering, or a related field (or equivalent experience) with 5+ years of DevOps SRE work. • Proficient in scripting/programming languages such as Python, Bash.
• Experience with cloud platforms (AWS preferred).
• Experience in DevOps practice, CI/CD, and monitoring tools.
• Experience with automation tools and configuration management frameworks such as Terraform, AWS CDK, Puppet, or Ansible.
• Strong troubleshooting and problem-solving skills with a keen attention to detail.
• Excellent communication and collaboration skills to work effectively in a cross-functional team environment.
• Strong experience in system administration, infrastructure management, or site reliability engineering. Additional information specifically for this job request Additionally, you should have:
• A good general understanding of distributed systems and microservice architecture.
• A solid technical background in IT system development/system administration.
• Software engineering background and/or experience in tool development (e.g., Python, JavaScript, Java, or Kotlin).
• Experience working with Application Performance Monitoring tools, Prometheus and Grafana).
• Good knowledge of SLA, SLO, SLI and how to use metrics to measure service levels and objectives.
• Experience working with centralized logging platforms (e.g., Elastic stack, Splunk, Datadog).
• Experience working with container orchestration (e.g., Kubernetes). Location: Remote or Hybrid, (If located in Gothenburg, minimum 3 days on site)
Language: Fluent English Tech Stack (in a flash) AWS, AWS CDK, Terraform, Python, Github Actions, Bash, MongoDB, ElasticSearch, Fluent Bit, Kibana, Grafana, Kafka, Prometheus, Docker, Kubernetes, Linux, PowerShell, ServiceNow, Atlassian, Opsgenie, Gitlab