使用 flagger 实现 automated canary analysis

项目地址：https://github.com/weaveworks/flagger

flagger 以 prometheus 的 metrics 作为依据，通过自动调整 virtualservice 的流量路由权重实现灰度发布

这里使用 rancher 部署，首先创建 flagger 名称空间，安装 istio、kube-prometheus(prometheus operator)、helm

部署 gateway 资源


apiVersion: networking.istio.io/v1alpha3  
kind: Gateway  
metadata:  
 name: flagger-gateway  
 namespace: flagger  
spec:  
 selector:  
 istio: ingressgateway  
 servers:  
 - hosts:  
 - '*'  
 port:  
 name: http  
 number: 80  
 protocol: HTTP

部署 flagger

添加 charts 仓库


helm repo add flagger https://flagger.app

部署 flagger，指定 istio，指定 prometheus


helm upgrade -i flagger flagger/flagger \
--namespace=cattle-prometheus-p-2p8nx \ # prometheus所在的namespace
--set crd.create=true \
--set meshProvider=istio \
--set metricsServer=http://prometheus-operated:9090

podinfo 的 deployment、hpa 资源


kubectl apply -f podinfo.yml --namespace=flagger

podinfo.yml


apiVersion: apps/v1
kind: Deployment
metadata:
  name: podinfo
  labels:
    app: podinfo
spec:
  minReadySeconds: 5
  revisionHistoryLimit: 5
  progressDeadlineSeconds: 60
  strategy:
    rollingUpdate:
      maxUnavailable: 1
    type: RollingUpdate
  selector:
    matchLabels:
      app: podinfo
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9797"
      labels:
        app: podinfo
    spec:
      containers:
      - name: podinfod
        image: stefanprodan/podinfo:3.1.0
        imagePullPolicy: IfNotPresent
        ports:
        - name: http
          containerPort: 9898
          protocol: TCP
        - name: http-metrics
          containerPort: 9797
          protocol: TCP
        - name: grpc
          containerPort: 9999
          protocol: TCP
        command:
        - ./podinfo
        - --port=9898
        - --port-metrics=9797
        - --grpc-port=9999
        - --grpc-service-name=podinfo
        - --level=info
        - --random-delay=false
        - --random-error=false
        livenessProbe:
          exec:
            command:
            - podcli
            - check
            - http
            - localhost:9898/healthz
          initialDelaySeconds: 5
          timeoutSeconds: 5
        readinessProbe:
          exec:
            command:
            - podcli
            - check
            - http
            - localhost:9898/readyz
          initialDelaySeconds: 5
          timeoutSeconds: 5
        resources:
          limits:
            cpu: 2000m
            memory: 512Mi
          requests:
            cpu: 100m
            memory: 64Mi
---
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: podinfo
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: podinfo
  minReplicas: 2
  maxReplicas: 4
  metrics:
  - type: Resource
    resource:
      name: cpu
      # scale up if usage is above
      # 99% of the requested CPU (100m)
      targetAverageUtilization: 99

部署 grafana

自带一个 istio-canary 面板，也可以导出面板模板导入其他 grafana 中


helm upgrade -i flagger-grafana flagger/grafana \
--namespace=cattle-prometheus-p-2p8nx \
--set url=http://prometheus-operated:9090

flagger-loadtester 的 deployment、service 资源


kubectl apply -f flagger-loadtester.yml --namespace=flagger

flagger-loadtester.yml


apiVersion: apps/v1
kind: Deployment
metadata:
  name: flagger-loadtester
  labels:
    app: flagger-loadtester
spec:
  selector:
    matchLabels:
      app: flagger-loadtester
  template:
    metadata:
      labels:
        app: flagger-loadtester
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
    spec:
      containers:
        - name: loadtester
          image: weaveworks/flagger-loadtester:0.11.0
          imagePullPolicy: IfNotPresent
          ports:
            - name: http
              containerPort: 8080
          command:
            - ./loadtester
            - -port=8080
            - -log-level=info
            - -timeout=1h
          livenessProbe:
            exec:
              command:
                - wget
                - --quiet
                - --tries=1
                - --timeout=4
                - --spider
                - http://localhost:8080/healthz
            timeoutSeconds: 5
          readinessProbe:
            exec:
              command:
                - wget
                - --quiet
                - --tries=1
                - --timeout=4
                - --spider
                - http://localhost:8080/healthz
            timeoutSeconds: 5
          resources:
            limits:
              memory: "512Mi"
              cpu: "1000m"
            requests:
              memory: "32Mi"
              cpu: "10m"
          securityContext:
            readOnlyRootFilesystem: true
            runAsUser: 10001
---
apiVersion: v1
kind: Service
metadata:
  name: flagger-loadtester
  labels:
    app: flagger-loadtester
spec:
  type: ClusterIP
  selector:
    app: flagger-loadtester
  ports:
    - name: http
      port: 80
      protocol: TCP
      targetPort: http

podinfo 的 canary 资源

rancher 中 istio的ingressgateway 默认 http2 端口 31380，在 slb 添加转发 80=>31380
在 hosts 中添加 app.istio.example.com 映射 slb 的解析
canary 资源创建 podinfo-primary 的 deployment 和 service，podinfo-canary 的 service，istio 的 destinationrules、virtualservice 等


kubectl apply -f podinfo-canary.yml --namespace=flagger

podinfo-canary.yml


apiVersion: flagger.app/v1alpha3
kind: Canary
metadata:
  name: podinfo
  namespace: flagger
spec:
  # deployment reference
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: podinfo
  # the maximum time in seconds for the canary deployment
  # to make progress before it is rollback (default 600s)
  progressDeadlineSeconds: 60
  # HPA reference (optional)
  autoscalerRef:
    apiVersion: autoscaling/v2beta1
    kind: HorizontalPodAutoscaler
    name: podinfo
  service:
    # service port number
    port: 9898
    # container port number or name (optional)
    targetPort: 9898
    # Istio gateways (optional)
    gateways:
    - mesh    
    - flagger-gateway
    # Istio virtual service host names (optional)
    hosts:
    - podinfo.flagger    
    - app.istio.example.com
    # Istio traffic policy (optional)
    trafficPolicy:
      tls:
        # use ISTIO_MUTUAL when mTLS is enabled
        mode: DISABLE
    # Istio retry policy (optional)
    retries:
      attempts: 3
      perTryTimeout: 1s
      retryOn: "gateway-error,connect-failure,refused-stream"
  canaryAnalysis:
    # schedule interval (default 60s)
    interval: 1m
    # max number of failed metric checks before rollback
    threshold: 5
    # max traffic percentage routed to canary
    # percentage (0-100)
    maxWeight: 50
    # canary increment step
    # percentage (0-100)
    stepWeight: 10
    metrics:
    - name: istio_requests_total
      # minimum req success rate (non 5xx responses)
      # percentage (0-100)
      threshold: 99
      interval: 30s
    - name: istio_request_duration_seconds_bucket
      # maximum req duration P99
      # milliseconds
      threshold: 500
      interval: 30s
    # testing (optional)
    webhooks:
      - name: acceptance-test
        type: pre-rollout
        url: http://flagger-loadtester.flagger/
        timeout: 30s
        metadata:
          type: bash
          cmd: "curl -sd 'test' http://podinfo-canary:9898/token | grep token"
      - name: load-test
        url: http://flagger-loadtester.flagger/
        timeout: 5s
        metadata:
          cmd: "hey -z 1m -q 10 -c 2 http://podinfo-canary.flagger:9898/"

金丝雀发布

部署新版本镜像


kubectl -n book set image deployment/podinfo \
podinfod=stefanprodan/podinfo:3.1.1

查看变化


watch 'kubectl -n flagger describe canary/podinfo | tail -n 5'

Events:

New revision detected podinfo.flagger
Scaling up podinfo.flagger
Waiting for podinfo.flagger rollout to finish: 0 of 1 updated replicas are available
Advance podinfo.flagger canary weight 5
Advance podinfo.flagger canary weight 10
Advance podinfo.flagger canary weight 15
Advance podinfo.flagger canary weight 20
Advance podinfo.flagger canary weight 25
Advance podinfo.flagger canary weight 30
Advance podinfo.flagger canary weight 35
Advance podinfo.flagger canary weight 40
Advance podinfo.flagger canary weight 45
Advance podinfo.flagger canary weight 50
Copying podinfo.flagger template spec to podinfo-primary.flagger
Waiting for podinfo-primary.flagger rollout to finish: 1 of 2 updated replicas are available
Promotion completed! Scaling down podinfo.flagger

A/B 测试

通过 header 匹配来路由流量


apiVersion: flagger.app/v1alpha3
kind: Canary
metadata:
  name: podinfo
  namespace: flagger
spec:
  # deployment reference
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: podinfo
  # the maximum time in seconds for the canary deployment
  # to make progress before it is rollback (default 600s)
  progressDeadlineSeconds: 60
  # HPA reference (optional)
  autoscalerRef:
    apiVersion: autoscaling/v2beta1
    kind: HorizontalPodAutoscaler
    name: podinfo
  service:
    # container port
    port: 9898
    # Istio gateways (optional)
    gateways:
    - flagger-gateway
    # Istio virtual service host names (optional)
    hosts:
    - app.istio.example.com
    # Istio traffic policy (optional)
    trafficPolicy:
      tls:
        # use ISTIO_MUTUAL when mTLS is enabled
        mode: DISABLE
  canaryAnalysis:
    # schedule interval (default 60s)
    interval: 1m
    # total number of iterations
    iterations: 10
    # max number of failed iterations before rollback
    threshold: 2
    # canary match condition
    match:
      - headers:
          user-agent:
            regex: "^(?!.*Chrome).*Safari.*"
      - headers:
          cookie:
            regex: "^(.*?;)?(type=insider)(;.*)?$"
    metrics:
    - name: request-success-rate
      # minimum req success rate (non 5xx responses)
      # percentage (0-100)
      threshold: 99
      interval: 1m
    - name: request-duration
      # maximum req duration P99
      # milliseconds
      threshold: 500
      interval: 30s
    # generate traffic during analysis
    webhooks:
      - name: load-test
        url: http://flagger-loadtester.flagger/
        timeout: 5s
        metadata:
          cmd: "hey -z 1m -q 10 -c 2 -H 'Cookie: type=insider' http://podinfo.flagger:9898/"

自动回滚

在金丝雀分析期间，可以生成 HTTP 500 错误和高响应延迟，以测试 Flagger 是否暂停升级。
在 loadtest 中执行命令，生成 HTTP 500 错误返回：


watch curl http://podinfo-canary:9898/status/500

生成延迟


watch curl http://podinfo-canary:9898/delay/1

当失败检查的数量达到金丝雀分析阈值时，流量被路由回主版本，金丝雀版本被缩放为 0，并且升级被标记为失败。
金丝雀报错和延迟峰值被记录为 Kubernetes 事件


kubectl -n cattle-prometheus-p-2p8nx logs deployment/flagger -f | jq .msg

Starting canary deployment for podinfo.flagger
Advance podinfo.flagger canary weight 5
Advance podinfo.flagger canary weight 10
Advance podinfo.flagger canary weight 15
Halt podinfo.flagger advancement success rate 69.17% < 99%
Halt podinfo.flagger advancement success rate 61.39% < 99%
Halt podinfo.flagger advancement success rate 55.06% < 99%
Halt podinfo.flagger advancement success rate 47.00% < 99%
Halt podinfo.flagger advancement success rate 37.00% < 99%
Halt podinfo.flagger advancement request duration 1.515s > 500ms
Halt podinfo.flagger advancement request duration 1.600s > 500ms
Halt podinfo.flagger advancement request duration 1.915s > 500ms
Halt podinfo.flagger advancement request duration 2.050s > 500ms
Halt podinfo.flagger advancement request duration 2.515s > 500ms
Rolling back podinfo.flagger failed checks threshold reached 10
Canary failed! Scaling down podinfo.flagger

使用 flagger 实现 automated canary analysis

部署 gateway 资源

部署 flagger

podinfo 的 deployment、hpa 资源

部署 grafana

flagger-loadtester 的 deployment、service 资源

podinfo 的 canary 资源

金丝雀发布

A/B 测试

自动回滚

相关帖子

网页剪裁时出现：请启动思源并确保网络连通后再试

有人用思源写长篇小说吗？

有批量 doc 转 md 的软件吗？没有的话我回头写一个

导出成 Word 文档时选择【合并子文档】，文档结构与直接导出不一致

思源媒体播放器 v0.3.2 更新（alist 网盘支持、一键创建媒体笔记）

自己下载的插件，已经按照文档进行编译了，但还是不能用

思源的引用搜索为什么不搜索有序序列？在设置中勾选了也不行，但改成无序序列就行

欢迎来到这里！