Elasticsearch Pitfalls Guide: 14 Practical Lessons from My Projects

Elasticsearch Pitfalls Guide: 14 Practical Lessons from My Projects

Introduction

> ❝

> When I first started using Elasticsearch (ES), it felt like a black box — throw data in, run a query, and somehow results just appeared.

> Taking ownership of my company’s core search module revealed that this “black box” hides countless details that demand attention.

> ❞

In this article, I’m sharing practical Elasticsearch lessons learned from real-world projects, covering:

  • Index Design
  • Field Types
  • Query Optimization
  • Cluster Management
  • Architecture Design
image

---

Index Design: From Basics to Advanced

1. Index Alias — Your Escape Route for Changes

Early mistakes included hard-coding index names in code. Changing a field type later showed me ES does not allow direct mapping changes or altering primary shard counts — you must rebuild the index.

(Adding fields is fine.)

Solution:

  • Use index aliases in all business logic.
  • When rebuilding an index, simply point the alias to the new index.
  • Fully transparent to users.

Think of it as giving the index a nickname — you can swap internals freely without breaking external references.

---

2. Routing — Make Queries Precise

In a SaaS e-commerce system, vendor-specific order queries were slow because ES's default shard allocation uses document ID hashing — scattering vendor data across shards.

Optimization:

  • Use vendor ID as the routing key when inserting/querying.
  • All of a vendor’s data resides on one shard.

Impact:

  • Before: Queries scanned all shards (e.g., 3 shards).
  • After: Only 1 shard scanned.
  • Result: Speed doubled, resource use decreased.

---

3. Shard Splitting — Handle Growing Data

Increasing shard count alone is not the best long-term strategy.

Best practices:

  • Business indexes: ≤ 30 GB per shard (aim for 10–30 GB)
  • Search indexes: ≤ 10 GB per shard
  • Log indexes: 20–50 GB per shard

In SaaS, large “super vendors” may cause skew.

Strategy: Split indexes using `merchantID % 64` (`orders_001` to `orders_064`), each subset with vendor-specific routing keys.

> ❝

> Choose splitting and routing rules based on data volumes and business needs.

> Avoid huge shard counts per ES node — ES 7.0+ defaults to max 1000 shards/node.

> Follow the heap memory to shard ratio guideline: approximately 1:20.

> ❞

---

Field Types: Choose Wisely Early

4. Text vs Keyword — Core Difference

Mistakenly storing phone numbers as text led to failed exact matches, because tokenization split `13800138000` into partial terms.

Rules:

  • Use text for tokenized full-text search (e.g., product descriptions)
  • Use keyword for exact matches (e.g., order IDs, phone numbers)
  • Keyword fields improve query speed and reduce storage needs

---

5. Multi-fields — Use Smartly

By default, ES makes a `keyword` sub-field for `text` fields; not always needed.

Selection criteria:

  • Keep multi-fields when you need exact match + full-text both.
  • Disable when only full-text search is required.
  • Saves storage and improves write performance.

---

6. Sorting Fields — Match the Data Type

Sorting numbers stored as `keyword` yields wrong order (e.g., `100` before `99`).

Fix:

  • Use long/integer for numeric sorting
  • Use date for time sorting
  • Result: Accurate ordering, better performance and memory use.

---

Query Optimization: Speed Without Sacrifice

7. Fuzzy Queries — Use the Right Way

Pre-ES 7.9: `wildcard` relies on regex; a leading wildcard triggers full term scans — resource-heavy.

Post-ES 7.9:

  • Use wildcard field type.
  • Improved via n-gram + binary doc values under the hood.

> Tip: For detailed pre/post-7.9 behavior, see "The 'Fuzzy' Duel with a Product Manager: Implementing MySQL LIKE '%xxx%' in ES".

---

8. Pagination — Avoid Deep Pages

Infinite scroll requests can demand deep pagination — big performance hit. Industry leaders limit pages for UX and efficiency.

If unavoidable:

  • Shallow (`from/size`): good for early pages.
  • Scroll: for large exports; costly, needs snapshots.
  • search_after: efficient sequential pages; cannot jump ahead freely.

Best: Design business logic to eliminate deep pagination altogether.

---

Cluster Management: Keep Operations Smooth

9. Index Lifecycle — Automate

Logs grow fast; unmanaged storage fills quickly.

Approach:

  • Daily-index naming (e.g., `log_YYYYMMDD`)
  • Retention rules (e.g., 7 or 30 days)
  • Use templates for automation.

---

10. Near Real-Time — Refresh Mechanism

Data isn’t instantly searchable because ES defaults to 1s refresh for balance.

Adjust:

  • High real-time needs → stay at 1s
  • Heavy write loads → lengthen interval

> ❝

> Tip: Where immediate data retrieval is critical —

> - Show submitted data directly on UI, re-query ES next time.

> - Delay UI query by ~1.5s after update.

> Collaboration between frontend and backend often solves this better than backend-only hacks.

> ❞

---

11. Memory Settings — Why 32 GB Limit Matters

Reason: Java's compressed pointers save memory only under 32 GB heaps. Over that → waste.

Recommendation: Give ES ~50% of node memory; leave the rest to the OS.

---

Architecture Design: Clear Role Division

12. ES vs Database — Each Does Its Job

Avoid storing full business data in ES to prevent consistency issues.

Model:

  • ES stores searchable criteria + document IDs
  • DB stores full data
  • ES finds IDs → DB retrieves details

---

13. Nested Objects — Preserve Relationships

Array data (e.g., product specs) stored as object type can flatten relationships.

Solution: Use nested type to maintain object independence inside arrays for accurate queries.

---

14. Replica Settings — Balance Read & Write

  • Most common: 1 replica
  • High query load: increase replicas
  • Remember: more replicas → heavier write load

---

Final Words

Using ES is like maintaining a growing road network — initial setup is simple, but traffic growth requires optimized design.

Key takeaway:

> Understanding why something is designed a certain way lets you solve problems without memorizing commands.

Technology’s true value lies in elegantly solving real problems.

> ❝

> If asked "How to use ES effectively?" I’d say:

> First understand the business scenario, then pick the technical approach — like choosing the right fuzzy search method based on your ES version.

> ❞

---

Pro Tip for Content Creators:

If you write and share technical articles like this, tools such as AiToEarn官网 can streamline AI-powered content generation, cross-platform publishing (Douyin, Kwai, WeChat, Bilibili, Rednote, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, X/Twitter), analytics, and AI model rankings (AI模型排名). This integration can enhance content delivery and monetization for technical knowledge.

Read more

Translate the following blog post title into English, concise and natural. Return plain text only without quotes. 哈佛大学 R 编程课程介绍

Harvard CS50: Introduction to Programming with R Harvard University offers exceptional beginner-friendly computer science courses. We’re excited to announce the release of Harvard CS50’s Introduction to Programming in R, a powerful language widely used for statistical computing, data science, and graphics. This course was developed by Carter Zenke.