Docker 容器资源限制最佳实践 在生产环境运行容器时,合理的资源限制是保障系统稳定性的关键。我在容器化改造项目中踩过不少坑,从 OOM Kill 到 CPU throttling,积累了一些实用的经验。
内存限制策略 基础内存配置 1 2 3 4 5 6 7 8 9 10 11 12 version: "3.8" services: web-app: image: myapp:latest deploy: resources: limits: memory: 512M reservations: memory: 256M mem_swappiness: 0
应用内存评估方法 1 2 3 4 5 6 7 8 9 10 11 12 docker run -d --name myapp-test myapp:latest docker stats myapp-test --no-stream for i in {1..60}; do docker stats myapp-test --no-stream --format "table {{.MemUsage}}" | tail -1 sleep 60 done ab -n 10000 -c 100 http://localhost:8080/api/users
实战案例:Java 应用内存调优 问题 :Java 应用容器频繁被 OOM Kill
分析过程 :
1 2 3 4 5 6 dmesg | grep -i "killed process" docker exec myapp-java jmap -histo 1 | head -20
解决方案 :
1 2 3 4 5 6 7 8 9 10 FROM openjdk:11 -jre-slimENV JAVA_OPTS="-Xms256m -Xmx358m -XX:+UseG1GC -XX:MaxGCPauseMillis=200" ENV JAVA_OPTS="$JAVA_OPTS -XX:+UseContainerSupport -XX:MaxRAMPercentage=70.0" COPY app.jar /app.jar ENTRYPOINT ["sh" , "-c" , "java $JAVA_OPTS -jar /app.jar" ]
CPU 限制与调度 CPU 资源配置 1 2 3 4 5 6 7 8 9 10 services: compute-app: image: compute-intensive:latest deploy: resources: limits: cpus: "2.0" reservations: cpus: "0.5" cpuset: "0,1"
CPU throttling 监控 1 2 3 4 5 6 7 8 cat /sys/fs/cgroup/cpu/docker/[container_id]/cpu.statthrottling_ratio = nr_throttled / nr_periods * 100%
实战案例:Go 服务 CPU 优化 问题现象 :Go 服务响应时间抖动严重,P99 延迟偶尔超过 5 秒
排查发现 :
1 2 3 4 docker exec myapp cat /sys/fs/cgroup/cpu/cpu.stat
优化方案 :
1 2 3 4 5 6 7 8 cpu_quota: 50000 cpu_period: 100000 cpu_quota: 200000 cpu_period: 100000 cpus: 1.0
磁盘 IO 限制 存储配置优化 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 services: database: image: postgres:13 deploy: resources: limits: memory: 2G volumes: - db_data:/var/lib/postgresql/data device_read_bps: - "/dev/sda:50mb" device_write_bps: - "/dev/sda:30mb" device_read_iops: - "/dev/sda:3000" device_write_iops: - "/dev/sda:2000"
IO 性能监控 1 2 3 4 5 docker exec myapp iostat -x 1 cat /sys/fs/cgroup/blkio/docker/[container_id]/blkio.throttle.io_service_bytes
网络资源管理 带宽限制 1 2 3 4 5 docker exec myapp tc qdisc add dev eth0 root handle 1: htb default 12 docker exec myapp tc class add dev eth0 parent 1: classid 1:1 htb rate 100mbit docker exec myapp tc class add dev eth0 parent 1:1 classid 1:12 htb rate 100mbit
连接数限制 1 2 3 4 5 6 7 8 9 10 11 12 13 14 services: nginx: image: nginx:alpine deploy: resources: limits: memory: 128M sysctls: - net.core.somaxconn=65535 - net.ipv4.ip_local_port_range=10000 65000 ulimits: nofile: soft: 65535 hard: 65535
容器资源监控 Prometheus 指标收集 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 version: "3.8" services: cadvisor: image: gcr.io/cadvisor/cadvisor:latest container_name: cadvisor ports: - "8080:8080" volumes: - /:/rootfs:ro - /var/run:/var/run:ro - /sys:/sys:ro - /var/lib/docker/:/var/lib/docker:ro command: - "--housekeeping_interval=10s" - "--docker_only=true"
关键监控指标 1 2 3 4 5 6 7 8 9 10 11 12 13 container_monitoring: memory_usage: query: "container_memory_usage_bytes / container_spec_memory_limit_bytes" threshold: 0.85 cpu_throttling: query: "rate(container_cpu_cfs_throttled_seconds_total[5m]) / rate(container_cpu_cfs_periods_total[5m])" threshold: 0.1 oom_kills: query: "increase(container_oom_kills_total[5m])" threshold: 0
资源限制测试 内存压力测试 1 2 3 4 5 docker run --rm -it --memory=100m progrium/stress \ --vm 1 --vm-bytes 150M --vm-hang 0
CPU 压力测试 1 2 3 4 5 docker run --rm -it --cpus="0.5" progrium/stress \ --cpu 2 --timeout 60s
生产环境最佳实践 资源配额模板 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 small_service_template: &small_service deploy: resources: limits: memory: 256M cpus: "0.5" reservations: memory: 128M cpus: "0.25" medium_service_template: &medium_service deploy: resources: limits: memory: 1G cpus: "1.0" reservations: memory: 512M cpus: "0.5" large_service_template: &large_service deploy: resources: limits: memory: 4G cpus: "2.0" reservations: memory: 2G cpus: "1.0"
自动扩缩容配置 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 services: web-app: image: myapp:latest <<: *medium_service deploy: replicas: 2 update_config: parallelism: 1 delay: 10s restart_policy: condition: any delay: 5s max_attempts: 3 healthcheck: test: ["CMD" , "curl" , "-f" , "http://localhost:8080/health" ] interval: 30s timeout: 10s retries: 3 start_period: 60s
故障恢复策略 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 #!/bin/bash CONTAINER_NAME=$1 MEMORY_THRESHOLD=90 while true ; do if ! docker ps | grep -q $CONTAINER_NAME ; then echo "Container $CONTAINER_NAME is not running, restarting..." docker-compose up -d $CONTAINER_NAME fi MEMORY_USAGE=$(docker stats $CONTAINER_NAME --no-stream --format "table {{.MemPerc}}" | tail -1 | sed 's/%//' ) if (( $(echo "$MEMORY_USAGE > $MEMORY_THRESHOLD " | bc -l) )); then echo "High memory usage detected: ${MEMORY_USAGE} %, restarting container..." docker-compose restart $CONTAINER_NAME fi sleep 60 done
合理的容器资源限制不是一次性设置就完事的,需要根据应用特点和业务负载持续调优。记住:宁可保守一点,也不要让容器影响宿主机的稳定性。在容器编排平台如 Kubernetes 中,这些实践同样适用,只是配置语法略有不同。