Docker容器资源限制最佳实践

Docker 容器资源限制最佳实践

在生产环境运行容器时,合理的资源限制是保障系统稳定性的关键。我在容器化改造项目中踩过不少坑,从 OOM Kill 到 CPU throttling,积累了一些实用的经验。

内存限制策略

基础内存配置

1
2
3
4
5
6
7
8
9
10
11
12
# docker-compose.yml
version: "3.8"
services:
web-app:
image: myapp:latest
deploy:
resources:
limits:
memory: 512M # 硬限制
reservations:
memory: 256M # 预留内存
mem_swappiness: 0 # 禁用swap

应用内存评估方法

1
2
3
4
5
6
7
8
9
10
11
12
# 1. 运行容器并监控内存使用
docker run -d --name myapp-test myapp:latest
docker stats myapp-test --no-stream

# 2. 分析内存增长趋势
for i in {1..60}; do
docker stats myapp-test --no-stream --format "table {{.MemUsage}}" | tail -1
sleep 60
done

# 3. 进行压力测试
ab -n 10000 -c 100 http://localhost:8080/api/users

实战案例:Java 应用内存调优

问题:Java 应用容器频繁被 OOM Kill

分析过程

1
2
3
4
5
6
# 查看OOM事件
dmesg | grep -i "killed process"
# java invoked oom-killer: gfp_mask=0x14000c0, order=0

# 检查容器内存使用
docker exec myapp-java jmap -histo 1 | head -20

解决方案

1
2
3
4
5
6
7
8
9
10
FROM openjdk:11-jre-slim

# 设置JVM参数,限制堆内存为容器限制的70%
ENV JAVA_OPTS="-Xms256m -Xmx358m -XX:+UseG1GC -XX:MaxGCPauseMillis=200"

# 启用容器感知
ENV JAVA_OPTS="$JAVA_OPTS -XX:+UseContainerSupport -XX:MaxRAMPercentage=70.0"

COPY app.jar /app.jar
ENTRYPOINT ["sh", "-c", "java $JAVA_OPTS -jar /app.jar"]

CPU 限制与调度

CPU 资源配置

1
2
3
4
5
6
7
8
9
10
services:
compute-app:
image: compute-intensive:latest
deploy:
resources:
limits:
cpus: "2.0" # 最多使用2个CPU
reservations:
cpus: "0.5" # 预留0.5个CPU
cpuset: "0,1" # 绑定到特定CPU核心

CPU throttling 监控

1
2
3
4
5
6
7
8
# 检查CPU throttling情况
cat /sys/fs/cgroup/cpu/docker/[container_id]/cpu.stat
# nr_periods: 周期数
# nr_throttled: 被限制的周期数
# throttled_time: 总的被限制时间

# 计算throttling比例
throttling_ratio = nr_throttled / nr_periods * 100%

实战案例:Go 服务 CPU 优化

问题现象:Go 服务响应时间抖动严重,P99 延迟偶尔超过 5 秒

排查发现

1
2
3
4
# CPU使用率看起来正常(50%),但存在严重throttling
docker exec myapp cat /sys/fs/cgroup/cpu/cpu.stat
# nr_throttled: 50000
# throttled_time: 180000000000 # 3分钟被限制时间

优化方案

1
2
3
4
5
6
7
8
# 原配置 - 限制过于严格
cpu_quota: 50000 # 0.5 CPU
cpu_period: 100000

# 优化后 - 允许突发使用
cpu_quota: 200000 # 2.0 CPU
cpu_period: 100000
cpus: 1.0 # 平均使用1个CPU

磁盘 IO 限制

存储配置优化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
services:
database:
image: postgres:13
deploy:
resources:
limits:
memory: 2G
volumes:
- db_data:/var/lib/postgresql/data
device_read_bps:
- "/dev/sda:50mb" # 读取限制50MB/s
device_write_bps:
- "/dev/sda:30mb" # 写入限制30MB/s
device_read_iops:
- "/dev/sda:3000" # 读取IOPS限制
device_write_iops:
- "/dev/sda:2000" # 写入IOPS限制

IO 性能监控

1
2
3
4
5
# 监控容器IO使用情况
docker exec myapp iostat -x 1

# 查看容器级别IO统计
cat /sys/fs/cgroup/blkio/docker/[container_id]/blkio.throttle.io_service_bytes

网络资源管理

带宽限制

1
2
3
4
5
# 使用tc (traffic control) 限制容器网络带宽
# 限制容器网络接口带宽为100Mbps
docker exec myapp tc qdisc add dev eth0 root handle 1: htb default 12
docker exec myapp tc class add dev eth0 parent 1: classid 1:1 htb rate 100mbit
docker exec myapp tc class add dev eth0 parent 1:1 classid 1:12 htb rate 100mbit

连接数限制

1
2
3
4
5
6
7
8
9
10
11
12
13
14
services:
nginx:
image: nginx:alpine
deploy:
resources:
limits:
memory: 128M
sysctls:
- net.core.somaxconn=65535 # 增加连接队列
- net.ipv4.ip_local_port_range=10000 65000 # 扩大端口范围
ulimits:
nofile:
soft: 65535
hard: 65535

容器资源监控

Prometheus 指标收集

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# docker-compose-monitoring.yml
version: "3.8"
services:
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
command:
- "--housekeeping_interval=10s"
- "--docker_only=true"

关键监控指标

1
2
3
4
5
6
7
8
9
10
11
12
13
# 监控规则配置
container_monitoring:
memory_usage:
query: "container_memory_usage_bytes / container_spec_memory_limit_bytes"
threshold: 0.85

cpu_throttling:
query: "rate(container_cpu_cfs_throttled_seconds_total[5m]) / rate(container_cpu_cfs_periods_total[5m])"
threshold: 0.1

oom_kills:
query: "increase(container_oom_kills_total[5m])"
threshold: 0

资源限制测试

内存压力测试

1
2
3
4
5
# 使用stress工具测试内存限制
docker run --rm -it --memory=100m progrium/stress \
--vm 1 --vm-bytes 150M --vm-hang 0

# 预期结果:容器被OOM Kill

CPU 压力测试

1
2
3
4
5
# 测试CPU限制
docker run --rm -it --cpus="0.5" progrium/stress \
--cpu 2 --timeout 60s

# 监控CPU使用率不应超过50%

生产环境最佳实践

资源配额模板

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# 小型服务 (API Gateway, 配置中心)
small_service_template: &small_service
deploy:
resources:
limits:
memory: 256M
cpus: "0.5"
reservations:
memory: 128M
cpus: "0.25"

# 中型服务 (业务服务)
medium_service_template: &medium_service
deploy:
resources:
limits:
memory: 1G
cpus: "1.0"
reservations:
memory: 512M
cpus: "0.5"

# 大型服务 (数据处理)
large_service_template: &large_service
deploy:
resources:
limits:
memory: 4G
cpus: "2.0"
reservations:
memory: 2G
cpus: "1.0"

自动扩缩容配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
services:
web-app:
image: myapp:latest
<<: *medium_service
deploy:
replicas: 2
update_config:
parallelism: 1
delay: 10s
restart_policy:
condition: any
delay: 5s
max_attempts: 3
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s

故障恢复策略

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#!/bin/bash
# container_health_check.sh

CONTAINER_NAME=$1
MEMORY_THRESHOLD=90 # 内存使用率阈值

while true; do
# 检查容器状态
if ! docker ps | grep -q $CONTAINER_NAME; then
echo "Container $CONTAINER_NAME is not running, restarting..."
docker-compose up -d $CONTAINER_NAME
fi

# 检查内存使用率
MEMORY_USAGE=$(docker stats $CONTAINER_NAME --no-stream --format "table {{.MemPerc}}" | tail -1 | sed 's/%//')

if (( $(echo "$MEMORY_USAGE > $MEMORY_THRESHOLD" | bc -l) )); then
echo "High memory usage detected: ${MEMORY_USAGE}%, restarting container..."
docker-compose restart $CONTAINER_NAME
fi

sleep 60
done

合理的容器资源限制不是一次性设置就完事的,需要根据应用特点和业务负载持续调优。记住:宁可保守一点,也不要让容器影响宿主机的稳定性。在容器编排平台如 Kubernetes 中,这些实践同样适用,只是配置语法略有不同。