SLA (Service Level Agreement)

The information on this page is written for Vamp v0.9.2

SLA stands for “Service Level Agreement”. Vamp uses it to define a pre-described set of boundaries to a service and the actions that should take place once the service crosses those boundaries. In essence, an SLA and its associated escalation is a workflow that is checked and controlled by Vamp based on the runtime behaviour of a service. SLAs and escalations are defined with the VAMP DSL.

The SLA event system

You can define an SLA for each cluster in a blueprint. A common example would be to check if the average response time of the cluster (averaged across all services) is higher or lower than a certain threshold. Under the hood, an SLA workflow creates two distinct events. These are are sent from Vamp and stored to Elasticsearch.

  • Escalate for a specific deployment and cluster
    e.g. if the response time is higher than the upper threshold.
  • DeEscalate for a specific deployment and cluster
    e.g. if the response time is lower than the lower threshold.

SLA monitoring is a continuous background process with a configurable interval time. On each run an SLA workflow is executed for each deployment & cluster that has an SLA defined. Within the same SLA definition it’s possible to define a list of escalations. Escalations are triggered by escalation events (Escalate/DeEscalate).

This means escalation events can be generated by the third party systems by sending them to Elasticsearch. This would allow scaling up or down to be triggered by basically any system that can POST a piece of JSON.

SLA’s are in essence pieces of code inside Vamp that stick to this event model and can use, if they want, the metrics and event data streaming out of Elasticsearch to make decisions on how things are and should be running.

SLA types

Vamp currently ships with the following SLA types:

  • response_time_sliding_window

Response time with sliding window

The response_time_sliding_window SLA triggers events based on response times.

Example - SLA defined inline in a blueprint.

Notice the SLA is defined per cluster and acts on the first service in the cluster.

Notice how the SLA is defined separately from the escalations. This is key to how Vamp approaches SLA’s and how modular and extendable the system is.

---
name: sava

gateways:
  80: sava/webport

clusters:

  sava:                        # the sava cluster
    services:
      breed:
        name: monarch
        deployable: vamp/monarch
        ports:
          webport: 80

      scale:
        cpu: 1
        memory: 1024MB
        instances: 2

    sla:                        # SLA applies to the first service in the sava cluster (monarch)
      # Type of SLA.
      type: response_time_sliding_window
      threshold:
        upper: 1000   # Upper threshold in milliseconds.
        lower: 100    # Lower threshold in milliseconds.
      window:
        interval: 600 # Time period in seconds used for
                      # average response time aggregation.
        cooldown: 600 # Time period in seconds. During this 
                      # period no new escalation events will 
                      # be generated. New event may be expected 
                      # not before cooldown + interval time has 
                      # been reached after the last event. 
     
      # List of escalations.
      escalations:
        - 
          type: scale_instances
          minimum: 1
          maximum: 3
          scale_by: 1

What next?