Site Reliability Engineer
We're seeking a Senior Site Reliability Engineer who excels at working at the Operational side of DevOps. Attention to detail, proactivity, and problem-solving skills are key, as is the ability to communicate and collaborate effectively. Job description Position: Senior SRE Engineer within Platform Operations and Support • A service minded team player with a quality driven approach • Manage and dispatch incident and service requests. • Provide high quality support, drive trouble shooting, RCAs and be advisor to Dev teams • Be responsible for maintaining the platform availability, shorten time to market for new features, and improve performance. • Play a crucial role in troubleshooting and quality assurance from an end-to-end perspective. • Focus on understanding, monitoring, and improving the production system, actively preventing future incidents. • Be a leading star for continuous improvements and innovations. Overview of responsibilities System support & troubleshooting • Guiding and coordinating junior colleagues within the team. • Assist in initial technical analysis for production incidents. • Support development team in building capabilities for alerts and monitoring. • Conduct code review for reported cases, fixes development, and delivery. Infrastructure Automation and Configuration Management • Develop and maintain automation tools, scripts, and configuration management systems. • Implement Infrastructure as Code (IaC) practices using tools like Ansible, Terraform, or Kubernetes. • Collaborate with development and operations teams to automate build, test, and deployment processes. Reliability Engineering and Resilience • Design and implement systems and processes to enhance infrastructure reliability and resilience. • Continuously improve system reliability by analyzing logs and trends, identifying areas for improvement, and implementing preventative measures. System Monitoring and Incident Response • Develop and manage monitoring tools and systems to track software and infrastructure health, performance, security, and availability. • Set up alerts, dashboards, and metrics for proactive detection and response to incidents. • Investigate and diagnose root causes of incidents and work towards resolution in a timely manner. Continuous Improvement and Collaboration • Drive a culture of continuous improvement by identifying areas for automation and efficiency. • Document procedures, incidents, and best practices for knowledge sharing and team efficiency. • Stay updated on industry trends and emerging technologies to propose innovative solutions. • Collaborate closely with cross-functional teams to ensure smooth operation of systems. Required skills & experience. • Bachelor's degree in computer science, Engineering, or a related field (or equivalent experience) with 5+ years of DevOps SRE work. • Proficient in scripting/programming languages such as Python, Bash. • Experience with cloud platforms (AWS preferred). • Experience in DevOps practice, CI/CD, and monitoring tools. • Experience with automation tools and configuration management frameworks such as Terraform, AWS CDK, Puppet, or Ansible. • Strong troubleshooting and problem-solving skills with a keen attention to detail. • Excellent communication and collaboration skills to work effectively in a cross-functional team environment. • Strong experience in system administration, infrastructure management, or site reliability engineering. Additional information specifically for this job request Additionally, you should have: • A good general understanding of distributed systems and microservice architecture. • A solid technical background in IT system development/system administration. • Software engineering background and/or experience in tool development (e.g., Python, JavaScript, Java, or Kotlin). • Experience working with Application Performance Monitoring tools, Prometheus and Grafana). • Good knowledge of SLA, SLO, SLI and how to use metrics to measure service levels and objectives. • Experience working with centralized logging platforms (e.g., Elastic stack, Splunk, Datadog). • Experience working with container orchestration (e.g., Kubernetes). Location: Remote or Hybrid, (If located in Gothenburg, minimum 3 days on site) Language: Fluent English Tech Stack (in a flash) AWS, AWS CDK, Terraform, Python, Github Actions, Bash, MongoDB, ElasticSearch, Fluent Bit, Kibana, Grafana, Kafka, Prometheus, Docker, Kubernetes, Linux, PowerShell, ServiceNow, Atlassian, Opsgenie, Gitlab

