Amazon OpenSearch Serverless Zstd 压缩实测：LZ4 vs Zstandard 索引压缩全面对比¶

Lab 信息

难度: ⭐⭐ 中级
预估时间: 45 分钟
预估费用: $2-5（Serverless OCU 按时计费）
Region: us-east-1
最后验证: 2026-04-10

背景¶

OpenSearch Serverless 的存储成本一直是用户最关注的话题之一。默认的 LZ4 压缩算法虽然解压速度极快，但压缩率并非最优。2026 年 4 月，AWS 宣布 OpenSearch Serverless 支持 Zstandard (zstd) 索引压缩，官方宣称可减少最高 32% 的索引大小。

但关键问题是：压缩率提升的代价是什么？ 查询会变慢吗？写入吞吐会降多少？不同压缩级别的边际效用如何？

本文通过在同一个 Serverless collection 中创建不同 codec 的索引、写入完全相同的数据集，拿到第一手对比数据。

前置条件¶

AWS 账号（需要 aoss:* 权限）
AWS CLI v2 已配置
awscurl（用于 SigV4 签名的 HTTP 请求）
Python 3（用于生成测试数据）

核心概念¶

三种编解码器对比¶

特性	LZ4 (默认)	zstd	zstd_no_dict
压缩算法	LZ4	Zstandard + 字典	Zstandard 无字典
压缩率	基线	最高	中等
解压速度	最快	较慢	中等
压缩级别	不可配置	1-6（默认 3）	1-6（默认 3）
设置方式	不需设置（默认）	`index.codec: zstd`	`index.codec: zstd_no_dict`

关键限制¶

index.codec 是静态设置 — 创建索引后不可更改。Serverless 不支持 close/reopen，因此完全不可变
仅影响 stored fields — 压缩只作用于段内最大的数据结构（stored fields），不影响倒排索引和 BKD 树
compression_level 必须配合 codec 使用 — 不能对默认 LZ4 设置压缩级别

动手实践¶

Step 1: 创建 Serverless Collection¶

首先创建 Encryption、Network、Data Access 三个策略，然后创建 TIMESERIES collection。

创建加密策略：

cat > /tmp/enc-policy.json << 'EOF'
{
  "Rules": [{"ResourceType": "collection", "Resource": ["collection/zstd-compression-test"]}],
  "AWSOwnedKey": true
}
EOF

aws opensearchserverless create-security-policy \
  --name zstd-test-enc \
  --type encryption \
  --policy file:///tmp/enc-policy.json \
  --region us-east-1

创建网络策略（公网访问，用于测试）：

cat > /tmp/net-policy.json << 'EOF'
[{
  "Rules": [{"ResourceType": "collection", "Resource": ["collection/zstd-compression-test"]},
            {"ResourceType": "dashboard", "Resource": ["collection/zstd-compression-test"]}],
  "AllowFromPublic": true
}]
EOF

aws opensearchserverless create-security-policy \
  --name zstd-test-net \
  --type network \
  --policy file:///tmp/net-policy.json \
  --region us-east-1

创建数据访问策略：

cat > /tmp/access-policy.json << 'EOF'
[{
  "Rules": [
    {"ResourceType": "index", "Resource": ["index/zstd-compression-test/*"],
     "Permission": ["aoss:CreateIndex", "aoss:DeleteIndex", "aoss:UpdateIndex",
                     "aoss:DescribeIndex", "aoss:ReadDocument", "aoss:WriteDocument"]},
    {"ResourceType": "collection", "Resource": ["collection/zstd-compression-test"],
     "Permission": ["aoss:CreateCollectionItems", "aoss:DescribeCollectionItems",
                     "aoss:UpdateCollectionItems"]}
  ],
  "Principal": ["arn:aws:iam::595842667825:user/awswhatsnewtest"]
}]
EOF

aws opensearchserverless create-access-policy \
  --name zstd-test-access \
  --type data \
  --policy file:///tmp/access-policy.json \
  --region us-east-1

创建 Collection（关闭冗余节省成本）：

aws opensearchserverless create-collection \
  --name zstd-compression-test \
  --type TIMESERIES \
  --standby-replicas DISABLED \
  --description "Zstd compression codec testing" \
  --region us-east-1

等待 Collection 变为 ACTIVE（约 1-2 分钟）：

aws opensearchserverless batch-get-collection \
  --names zstd-compression-test \
  --region us-east-1 \
  --query 'collectionDetails[0].[status,collectionEndpoint]' \
  --output text

实测输出：

ACTIVE  https://y9tjrj5s9ku5izrgjrn9.us-east-1.aoss.amazonaws.com

Step 2: 创建不同 Codec 的索引¶

在同一个 Collection 中创建 5 个索引，使用不同的压缩配置：

ENDPOINT="https://y9tjrj5s9ku5izrgjrn9.us-east-1.aoss.amazonaws.com"

# 索引 1: LZ4 (默认)
awscurl --service aoss --region us-east-1 -X PUT "$ENDPOINT/log-lz4" \
  -H "Content-Type: application/json" \
  -d '{
    "settings": {"number_of_shards": 1, "number_of_replicas": 0},
    "mappings": {"properties": {
      "@timestamp": {"type": "date"}, "level": {"type": "keyword"},
      "message": {"type": "text"}, "source_ip": {"type": "ip"},
      "user_agent": {"type": "text"}, "response_code": {"type": "integer"},
      "bytes_sent": {"type": "long"}, "request_path": {"type": "keyword"},
      "host": {"type": "keyword"}, "region": {"type": "keyword"}
    }}
  }'

# 索引 2: zstd Level 1
awscurl --service aoss --region us-east-1 -X PUT "$ENDPOINT/log-zstd-l1" \
  -H "Content-Type: application/json" \
  -d '{
    "settings": {"number_of_shards": 1, "number_of_replicas": 0,
                 "index.codec": "zstd", "index.codec.compression_level": 1},
    "mappings": {"properties": {
      "@timestamp": {"type": "date"}, "level": {"type": "keyword"},
      "message": {"type": "text"}, "source_ip": {"type": "ip"},
      "user_agent": {"type": "text"}, "response_code": {"type": "integer"},
      "bytes_sent": {"type": "long"}, "request_path": {"type": "keyword"},
      "host": {"type": "keyword"}, "region": {"type": "keyword"}
    }}
  }'

# 索引 3: zstd Level 3 (默认级别)
# 同上，将 compression_level 改为 3

# 索引 4: zstd Level 6 (最高)
# 同上，将 compression_level 改为 6

# 索引 5: zstd_no_dict Level 3
# 同上，将 index.codec 改为 "zstd_no_dict"

Step 3: 生成并加载测试数据¶

生成 50,000 条模拟日志数据（确定性随机，seed=42），确保每个索引接收完全相同的数据：

import json, random

random.seed(42)
NUM_DOCS = 50000

# 生成 NDJSON 格式的 bulk 数据
with open("/tmp/bulk_data.ndjson", "w") as f:
    for i in range(NUM_DOCS):
        action = json.dumps({"create": {}})
        doc = json.dumps({
            "@timestamp": 1775800000000 + i * 1000 + random.randint(0, 999),
            "level": random.choice(["INFO", "WARN", "ERROR", "DEBUG", "TRACE"]),
            "message": f"Request processed in {random.randint(1, 5000)}ms",
            "source_ip": f"{random.randint(1,223)}.{random.randint(0,255)}.{random.randint(0,255)}.{random.randint(1,254)}",
            "user_agent": random.choice(["Chrome/120.0", "Safari/605.1", "curl/8.4.0"]),
            "response_code": random.choice([200, 200, 200, 201, 301, 400, 404, 500]),
            "bytes_sent": random.randint(100, 500000),
            "request_path": random.choice(["/api/v1/users", "/api/v1/orders", "/health"]),
            "host": random.choice(["web-01.prod", "api-01.prod"]),
            "region": random.choice(["us-east-1", "eu-west-1"])
        })
        f.write(action + "\n" + doc + "\n")

分批加载（500 docs/batch）到每个索引：

# 对每个索引执行
for INDEX in log-lz4 log-zstd-l1 log-zstd-l3 log-zstd-l6 log-zstd-nodict-l3; do
  for BATCH in /tmp/batch_*.ndjson; do
    awscurl --service aoss --region us-east-1 \
      -X POST "$ENDPOINT/$INDEX/_bulk" \
      -H "Content-Type: application/x-ndjson" \
      -d @$BATCH
  done
done

加载完成后 force merge 到单个 segment，确保公平比较：

for INDEX in log-lz4 log-zstd-l1 log-zstd-l3 log-zstd-l6 log-zstd-nodict-l3; do
  awscurl --service aoss --region us-east-1 \
    -X POST "$ENDPOINT/$INDEX/_forcemerge?max_num_segments=1"
done

Step 4: 压缩率对比（核心实验）¶

查看 force merge 后的索引大小：

awscurl --service aoss --region us-east-1 \
  "$ENDPOINT/_cat/indices?v&s=index&h=index,docs.count,store.size"

实测输出：

index              docs.count store.size
log-lz4                 50000      7.5mb
log-zstd-l1             50000      5.8mb
log-zstd-l3             50000      5.6mb
log-zstd-l6             50000      5.3mb
log-zstd-nodict-l3      50000      5.8mb

Step 5: 查询延迟对比¶

对每个索引执行三种查询各 10 次，记录 took 时间（ms）：

# match_all 查询
awscurl --service aoss --region us-east-1 \
  -X POST "$ENDPOINT/log-lz4/_search" \
  -H "Content-Type: application/json" \
  -d '{"size": 100, "query": {"match_all": {}}}'

# range 查询
awscurl --service aoss --region us-east-1 \
  -X POST "$ENDPOINT/log-lz4/_search" \
  -H "Content-Type: application/json" \
  -d '{"size": 100, "query": {"range": {"@timestamp": {"gte": 1775810000000, "lte": 1775820000000}}}}'

# 聚合查询
awscurl --service aoss --region us-east-1 \
  -X POST "$ENDPOINT/log-lz4/_search" \
  -H "Content-Type: application/json" \
  -d '{"size": 0, "aggs": {"by_level": {"terms": {"field": "level"}}, "avg_bytes": {"avg": {"field": "bytes_sent"}}}}'

Step 6: 边界测试¶

无效压缩级别：

# 级别 0 — 低于最小值
awscurl --service aoss --region us-east-1 -X PUT "$ENDPOINT/test-invalid" \
  -H "Content-Type: application/json" \
  -d '{"settings": {"index.codec": "zstd", "index.codec.compression_level": 0}}'

实测输出：

{
  "error": {
    "type": "illegal_argument_exception",
    "reason": "Failed to parse value [0] for setting [index.codec.compression_level] must be >= 1"
  },
  "status": 400
}

# 级别 7 — 超过最大值
awscurl --service aoss --region us-east-1 -X PUT "$ENDPOINT/test-invalid" \
  -H "Content-Type: application/json" \
  -d '{"settings": {"index.codec": "zstd", "index.codec.compression_level": 7}}'

实测输出：

{"error": {"reason": "Failed to parse value [7] for setting [index.codec.compression_level] must be <= 6"}, "status": 400}

# 对 LZ4 设置 compression_level
awscurl --service aoss --region us-east-1 -X PUT "$ENDPOINT/test-invalid" \
  -H "Content-Type: application/json" \
  -d '{"settings": {"index.codec.compression_level": 3}}'

实测输出：

{"error": {"reason": "missing required setting [index.codec] for setting [index.codec.compression_level]"}, "status": 400}

Codec 不可变性验证：

# 尝试修改已有索引的 codec
awscurl --service aoss --region us-east-1 -X PUT "$ENDPOINT/log-lz4/_settings" \
  -H "Content-Type: application/json" \
  -d '{"index.codec": "zstd"}'

实测输出：

{"error": {"reason": "Can't update non dynamic settings [[index.codec]] for indices [log-lz4]"}, "status": 400}

测试结果¶

压缩率对比¶

索引	Codec	级别	大小	vs LZ4	官方参考值
log-lz4	LZ4 (默认)	—	7.5 MB	基线	基线
log-zstd-l1	zstd	1	5.8 MB	-22.7%	-28.1%
log-zstd-l3	zstd	3	5.6 MB	-25.3%	—
log-zstd-l6	zstd	6	5.3 MB	-29.3%	-32%
log-zstd-nodict-l3	zstd_no_dict	3	5.8 MB	-22.7%	—

实测 vs 官方数据

我们的实测压缩率（22-29%）略低于官方 benchmark（26-32%），这是因为官方使用的是 nyc_taxi 数据集（结构化出租车记录，重复模式更多），而我们使用的是模拟日志数据（随机 IP、随机消息，压缩空间相对较小）。在真实日志场景中，压缩率预计在两者之间。

查询延迟对比（10 次取后 8 次中位数，ms）¶

索引	match_all	range	aggregation
log-lz4	16	15	14
log-zstd-l1	14	16.5	9
log-zstd-l3	12	13	8.5
log-zstd-l6	10.5	13	8
log-zstd-nodict-l3	11	12	8.5

关键发现：查询延迟没有变差，反而略有改善

zstd 索引的查询延迟全面优于或持平 LZ4。这与直觉相反——压缩更紧凑的数据意味着更少的 I/O，在 Serverless 的 S3 存储架构下，减少数据读取量 > 解压开销。

写入吞吐对比¶

索引	5000 docs bulk 平均耗时	vs LZ4
log-lz4	18,801 ms	基线
log-zstd-l1	~18,800 ms	≈0%
log-zstd-l3	18,753 ms	≈0%
log-zstd-l6	18,880 ms	≈0%
log-zstd-nodict-l3	18,686 ms	≈0%

Serverless 架构下写入吞吐差异被掩盖

与管理域 benchmark（L6 吞吐降 24%）不同，Serverless 架构下所有 codec 的写入吞吐几乎一致。这是因为 Serverless 的 OCU 自动扩展补偿了压缩计算开销，加上网络延迟和 SigV4 签名开销是主要瓶颈。

边界测试汇总¶

#	测试场景	结果	错误信息
1	compression_level=0	✅ 预期 400	`must be >= 1`
2	compression_level=7	✅ 预期 400	`must be <= 6`
3	LZ4 + compression_level	✅ 预期 400	`missing required setting [index.codec]`
4	修改已有索引 codec	✅ 预期 400	`Can't update non dynamic settings`
5	修改已有索引 compression_level	✅ 预期 400	`Can't update non dynamic settings`

踩坑记录¶

踩坑 1: index.codec 创建后完全不可变

在管理域（managed domain）中，可以通过 close → 修改 settings → reopen 来更改 codec。但 Serverless 不支持 close/reopen 操作，因此 codec 一旦选定就不可更改。如果选错了 codec，唯一的解决方案是创建新索引并 reindex。

生产影响：如果你有大量数据在 LZ4 索引中想迁移到 zstd，需要创建新索引 + reindex，这意味着双倍 OCU 消耗和数据传输成本。建议在创建索引前就确定好 codec 策略。

踩坑 2: compression_level 不能单独使用

设置 index.codec.compression_level: 3 而不设置 index.codec 会报错 missing required setting [index.codec]。这在官方文档中没有明确说明，容易让人误以为可以在默认 LZ4 上调压缩级别。

实测发现，官方未记录

发现：zstd_no_dict L1 Range Query 延迟异常未复现

官方 benchmark 显示 zstd_no_dict L1 的 Range Query p90 延迟恶化了 282.9%，但在我们的 Serverless 实测中并未复现此异常。可能原因：(1) 官方测试基于管理域而非 Serverless，(2) 数据集不同，(3) Serverless 的 OCU 架构行为不同。

费用明细¶

资源	单价	用量	费用
Serverless OCU (无冗余，1 OCU)	$0.24/OCU-hr	~2 hr	~$0.48
S3 存储	$0.024/GB-month	~50 MB	< $0.01
额外 OCU (auto-scaling)	$0.24/OCU-hr	~1 hr	~$0.24
合计			~$0.72

清理资源¶

ENDPOINT="https://y9tjrj5s9ku5izrgjrn9.us-east-1.aoss.amazonaws.com"

# 1. 删除所有索引
for INDEX in log-lz4 log-zstd-l1 log-zstd-l3 log-zstd-l6 log-zstd-nodict-l3; do
  awscurl --service aoss --region us-east-1 \
    -X DELETE "$ENDPOINT/$INDEX"
done

# 2. 删除 Collection
aws opensearchserverless delete-collection \
  --id y9tjrj5s9ku5izrgjrn9 \
  --region us-east-1

# 3. 删除策略
aws opensearchserverless delete-security-policy \
  --name zstd-test-enc --type encryption --region us-east-1
aws opensearchserverless delete-security-policy \
  --name zstd-test-net --type network --region us-east-1
aws opensearchserverless delete-access-policy \
  --name zstd-test-access --type data --region us-east-1

务必清理

OpenSearch Serverless 即使无流量也会持续计费（最低 1 OCU = $0.24/hr ≈ $5.76/天）。Lab 完成后请立即删除 Collection。

结论与建议¶

Codec 选择指南¶

场景	推荐 Codec	理由
低延迟优先（实时搜索）	LZ4 (默认)	解压最快，延迟最低
存储成本优先（大规模日志）	zstd L3	压缩率 -25% 起步，查询无 penalty
极致压缩（归档型日志）	zstd L6	压缩率 -29%，查询反而更快
平衡方案	zstd_no_dict L3	压缩率 -23%，无字典开销

核心发现¶

zstd 是"免费午餐" — 在 Serverless 上，zstd 相比 LZ4 不仅压缩率提升 22-29%，查询延迟反而略有改善，写入吞吐几乎无差异。对于新索引，没有理由不用 zstd
L3 是最优平衡点 — L1→L3 压缩率提升 2.6 个百分点，L3→L6 只多 4 个百分点。边际效用递减，L3 是性价比最高的选择
zstd vs zstd_no_dict 差异不大 — 字典压缩多提供约 2.6% 的额外压缩，但日志场景下差异不大。对于担心字典开销的用户，zstd_no_dict 是安全选择
提前规划 codec — 创建后不可更改，生产环境建议在索引模板中统一设置

Amazon OpenSearch Serverless Zstd 压缩实测：LZ4 vs Zstandard 索引压缩全面对比¶

背景¶

前置条件¶

核心概念¶

三种编解码器对比¶

关键限制¶

动手实践¶

Step 1: 创建 Serverless Collection¶

Step 2: 创建不同 Codec 的索引¶

Step 3: 生成并加载测试数据¶

Step 4: 压缩率对比（核心实验）¶

Step 5: 查询延迟对比¶

Step 6: 边界测试¶

测试结果¶

压缩率对比¶

查询延迟对比（10 次取后 8 次中位数，ms）¶

写入吞吐对比¶

边界测试汇总¶

踩坑记录¶

费用明细¶

清理资源¶

结论与建议¶

Codec 选择指南¶

核心发现¶

参考链接¶