Agenda

  • DevOps Benefits
  • Our Journey
  • Lessons
  • Current Status
  • Conclusion

Why DevOps?

Why DevOps

Software Delivery Performance

Those that develop and deliver quickly are better able to experiment with ways to increase customer adoption and satisfaction, pivot when necessary, and keep up with compliance and regulatory demands.

Software Delivery and Operational Performance

Why DevOps

Organisational Performance

Software Delivery and Operational Performance (SDO)

Software Delivery Performance

Service Operational Performance

Software Delivery Performance

  • Cycle Time (from commit to production)
  • Failure Rate

Service Operational Performance

  • Availability
  • Time to Restore
Performance Distribution

Our Journey

  • First team to adopt cloud (2017)
  • First team to adopt the "You Build It, You Run It" principle
  • Responsible to provide data for reporting
  • Learned by doing, and now sharing...

Operational Models

Normal Scrum/XP setup was not enough

  • Should we just add a Ops person to the team?
  • Are we responsible for every little thing?
  • Compliance? Regulations? Pipelines?

Operational Models

  • Amazon Two Pizza Teams
  • Google SRE

Amazon Two Pizza Teams

  • How many people can you feed with 2 big (American Size) pizzas?
  • Operational responsibility inside the team
  • Customer contact

Google SRE

  • Site Reliability Engineers
  • Close collaboration with Delivery Teams
  • Veto power on production environments

Tyro Way

Delivery Teams

  • Service Owners
  • You Build It, You Run It principle

Platform Teams

  • Build the rails
  • Facilitate Compliance, Security, Regulatory Work...

New area of knowledge

We are developers, we know how to do it...

New area of knowledge

Define your boundaries

graph LR ma[micro-service A] --> mb[micro-service B] ma[micro-service A] --> mc[micro-service C] mc[micro-service C] --> md[micro-service D]
graph LR subgraph Business Capability ma[micro-service A] --> mb[micro-service B] ma[micro-service A] --> mc[micro-service C] mc[micro-service C] --> md[micro-service D] end

Define the business capability you will support

Ex: Provide data for reporting

  • We needed deep involvement from Busines
  • We had to stop talking about software and talk about services

Defining Service Levels

SLI - Service Level Indicators

A carefully defined quantitative measure of some aspect of the level of service that is provided
  • Ex:
    • Response Time
    • Data Durability
    • Up Time
    • Availability

SLO - Service Level Objectives

A target value or range of values for a service level that is measured by an SLI
  • Examples:
    • Availability: 99.95%
    • Response Time:
      • 95 percentile: <= 500ms
      • 99 percentile: < 1000ms
    • Data Durability: 5 years
Availability level Allowed unavailability window
per year per quarter per month per week per day per hour
90% 36.5 days 9 days 3 days 16.8 hours 2.4 hours 6 minutes
95% 18.25 days 4.5 days 1.5 days 8.4 hours 1.2 hours 3 minutes
99% 3.65 days 21.6 hours 7.2 hours 1.68 hours 14.4 minutes 36 seconds
99.5% 1.83 days 10.8 hours 3.6 hours 50.4 minutes 7.20 minutes 18 seconds
99.9% 8.76 hours 2.16 hours 43.2 minutes 10.1 minutes 1.44 minutes 3.6 seconds
99.95% 4.38 hours 1.08 hours 21.6 minutes 5.04 minutes 43.2 seconds 1.8 seconds
99.99% 52.6 minutes 12.96 minutes 4.32 minutes 60.5 seconds 8.64 seconds 0.36 seconds
99.999% 5.26 minutes 1.30 minutes 25.9 seconds 6.05 seconds 0.87 seconds 0.04 seconds

SLA - Service Level Agreements

An explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain

MAO - Maximum Allowed Outage

Maximum amount of time a system can be unavailable before its loss will compromise the organization's objectives or survival. AKA: Maxium Tolerable Period of Disruption (MTPOD)

RTO - Recovery Time Objective

Targeted duration of time and a service level within which a business process must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in business continuity

Security

Least Privilege

Break Glass

Defense in depth

Monitoring

Monitoring

Closed Box

Open Box

  • How the user sees your service
  • Knows the internals of the application
    • Logs
    • JVM
    • Health Check Pages

Monitoring Dependency

graph LR APP --> tool1(Monitoring Tool)

Monitoring Dependency

graph LR APP -.-> tool1(Monitoring Tool)

Monitoring Dependency

graph LR APP -. logs .- tool1(Monitoring Tool)

Dashboards

Alerting

Alerting

  • Inform humans about a unexpected behaviour
  • Not able to fulfil a SLO
  • Some dependency is malfunctioning

Alerting Channels vs Severity

Severity Channel
Trivial Log
Low Ticket
Medium e-mail
High Pager

Alerting Lesson

  • Too much alerting is the same as no alerting

On-call responsibilities

  • Monitor the SLOs
  • Work on Resilience stories
  • Veto power to Deploy and Releases

Post-mortem

Pre-work

Capture an incident timeline

Hosting Post-mortem session

Apply the retrospective prime directive

Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.

Incident Retro

Post-mortem

  • Impacts our backlog
  • Improves our architecture
  • Improves our application design

Release Strategy

Dark Launch

graph LR Client ==> Current style Current fill:#f3faff

Dark Launch

graph LR Client ==> Current New(New) style New fill:#6FF4BA style Current fill:#f3faff

Dark Launch

graph LR Client ==> Current Client -.-> New(New) Current -. Compare .- New style New fill:#6FF4BA style Current fill:#f3faff

Dark Launch

graph LR Client ==> New(New) style New fill:#6FF4BA

Our current status

Deployes per week

Release Time (from commit to deploy)

Availability

Worst 30 days between November/2018 and December/2018:

99.6%

Conclusion

  • We are happy, there is no turning back
  • Easier than never to introduce changes in production
  • Enhancing reporting experience to our clients
  • Spreading the lessons to other teams