At Fiddler AI, our mission is to provide a unified AI observability and security platform for agents, GenAI, and predictive ML. This means giving our customers deep insights into how their AI systems are behaving and performing in production. To power this we analyze the model inference logs – or "events" and for that we rely on a robust and performant data store, and for us, that's ClickHouse.
ClickHouse: The Analytical Powerhouse
ClickHouse is an open-source, column-oriented database management system renowned for its blazing-fast analytical query performance. Designed for real-time reporting and large-scale data ingestion, its columnar storage significantly accelerates queries that involve aggregations and filtering, making it an ideal choice for an AI observability and security platform like Fiddler. For those keen to dive deeper, the official ClickHouse documentation is an excellent resource: https://clickhouse.com/docs/
The Inevitable Duplicates: Why They Appear and Their Impact
In the world of distributed systems and high-throughput data ingestion, duplicates are an almost unavoidable reality. While ideally, every piece of data would arrive exactly once, several factors can conspire to create duplicate entries in your database:
- Network Retries and Transient Failures: When a client (like Fiddler's ingestion service) attempts to write data, the operation might fail temporarily. If the original operation did succeed on the server but the acknowledgment got lost, a retry will result in a duplicate.
- Application-Level Retries/Idempotency Challenges: Even without network issues, the application logic itself might trigger retries, re-attempting a batch of events where some might have already been processed.
- Source System Behavior: External systems sending data to Fiddler might, by design or error, send the same event multiple times.
The presence of duplicate data silently undermines the reliability and trustworthiness of an AI observability platform. The consequences are far-reaching and can lead to incorrect insights and misguided decisions:
- Inaccurate Metrics and Reporting: This is perhaps the most direct and damaging impact. Duplicates can lead to inflated metrics such as traffic, distorting histograms and data integrity checks. Consequently, any reporting derived from these metrics will be skewed over time.
- Misleading Trends: If duplicates are not consistent over time, they can create artificial spikes or dips in trends, making it difficult to discern true changes in application or model behavior.
- False Positive/Negative Alerts: Inflated error rates or artificially high drift scores due to duplicates can trigger unnecessary alerts, leading to alert fatigue. Conversely, genuine issues might be masked.
- Increased Storage and Processing Costs: Duplicates consume valuable disk space in ClickHouse. More importantly, they lead to more data being processed during read queries, impacting query performance, increasing resource consumption, and driving up infrastructure costs.
- Erosion of Trust: Ultimately, if the data presented to users is known to contain inaccuracies, it erodes trust in the platform.
Fiddler's Multi-Layered Deduplication Strategy: From API to Database
Recognizing the critical importance of accurate data, Fiddler employs a sophisticated, multi-pronged approach to deduplication, tackling potential duplicates at various stages of the data ingestion pipeline. Our multi-layered strategy allows for various types of deduplication at different stages. Addressing deduplication early in the pipeline is more cost-effective than managing it once data has reached ClickHouse.

1. API Level Deduplication with Redis
- When it Happens: This is the first line of defense against duplicates, occurring as soon as event data hits Fiddler's ingestion APIs.
- How it Works: For every incoming event, Fiddler calculates a unique hash of the events data. This hash is then stored in a Redis cache for a short, effective duration (e.g., 30 minutes). If an event with an identical hash arrives within this timeframe, it is immediately identified as a duplicate and dropped at the API gateway level, preventing it from entering the rest of the ingestion pipeline.
- Why it Matters: This provides near real-time deduplication for rapidly arriving duplicates, minimizing the load on downstream systems. It's highly effective and relatively cheap for bursty duplicate traffic.
2. Application-Level Micro-Batch Status Tracking (PostgreSQL)
- When it Happens: After passing the API deduplication, events are batched and prepared for persistent storage. This layer ensures that logical batches are processed only once.
- How it Works: Fiddler maintains a separate record in a PostgreSQL database for the status of each ingested micro-batch (chunk) of events. Before attempting to write a chunk to ClickHouse, the ingestion service first checks its status in PostgreSQL. If that specific chunk has already been marked as successfully processed, the ingestion process for that chunk is skipped entirely.
- Why it Matters: This prevents redundant work and avoids inserting data that has already been confirmed as ingested, even if higher-level application logic triggers a retry for a larger processing unit.
3. Idempotent Inserts with insert_deduplication_token
(ClickHouse)
- When it Happens: As Fiddler's ingestion services perform the actual write operations into ClickHouse.
- How it Works: When Fiddler's ingestion service sends a batch of events to ClickHouse, it attaches a unique
insert_deduplication_token
to that specific batch. If the network drops or ClickHouse is momentarily unavailable and the client retries the insert with the same token, ClickHouse recognizes it. It checks its internal state and, if that token has already been successfully processed, it simply acknowledges the "new" insert without actually writing the data again.- A Note on Deduplication Window: For replicated tables, ClickHouse's block-level deduplication (which
insert_deduplication_token
leverages) is influenced by settings likereplicated_deduplication_window
(number of blocks) andreplicated_deduplication_window_seconds
(time window). These settings determine how long ClickHouse keeps track of inserted block hashes for deduplication purposes, ensuring that duplicates are detected within a configurable timeframe. While a larger window could identify more duplicate blocks, managing block ID hashes in Keeper/ClickHouse incurs a cost. Larger windows create big files with hashes. Small values like 100 or 1000 are often sufficient, especially given that the window typically operates for a short period (seconds to minutes) between the initial insert and retry. This period should be set based on the insert rate; for example, 1000 inserts per minute per table is considered a very high ingest rate.
- A Note on Deduplication Window: For replicated tables, ClickHouse's block-level deduplication (which
- Why it Matters: This is crucial for handling duplicates arising from retries of the same insert operation due to transient network failures or timeouts. It guarantees that a logical batch is written "at most once" at the ClickHouse interface. You can learn more about this powerful setting in the ClickHouse documentation: https://clickhouse.com/docs/operations/settings/settings#insert_deduplicate (Note: While
insert_deduplication_token
is specifically for client-provided tokens, the linked insert_deduplicate setting provides context on ClickHouse's built-in deduplication mechanisms for replicated tables, which is related to the token's purpose).
4. Eventually Consistent Deduplication with ReplacingMergeTree
(ClickHouse)
- When it Happens: This is the final, continuous layer of deduplication, handled directly by the ClickHouse database engine in the background.
- How it Works:
- The Engine: Fiddler configures its ClickHouse tables that store inference events to use the
ReplacingMergeTree
engine. Crucially, we define theORDER BY
clause (which forms the primary key for merges) to uniquely identify an event (e.g., a combination ofevent_id
,model_id
, and a timestamp). We also specify a version column (a timestamp with millisecond precision) to keep the "latest" version of the event, defaults toNOW64()
. - The Magic: ClickHouse MergeTree tables are designed to periodically merge smaller data parts into larger ones in the background. When
ReplacingMergeTree
performs these merges, it identifies rows that have identical values in theirORDER BY
columns. For each set of duplicates, it retains only one row – the one with the maximum value (latest timestamp based onNOW64())
in the specifiedversion
column.
- The Engine: Fiddler configures its ClickHouse tables that store inference events to use the
- Why it Matters: This tackles broader duplicate scenarios, including those originating from external systems, re-processing of historical data, or cases where the same logical event might arrive at different times or via different paths. While duplicates might temporarily exist in different data parts after ingestion, they are eventually and automatically de-duplicated by ClickHouse itself. This provides an out-of-the-box, highly efficient, and transparent mechanism for ensuring the eventual uniqueness of event data without manual intervention. For more details, refer to the ClickHouse documentation on
ReplacingMergeTree
: https://clickhouse.com/docs/engines/table-engines/mergetree-family/replacingmergetree
The Evolution of Deduplication: From MergeTree to ReplacingMergeTree
Earlier in our journey, Fiddler utilized the basic MergeTree engine. While MergeTree is excellent for high-volume append-only data, deduplicating events with it presented significant challenges:
- Manual Deduplication Post-Ingestion: With MergeTree, any deduplication would have required running periodic
DELETE
queries, which are resource-intensive and introduce complexities in managing data consistency. This also meant that queries executed before such a deduplication run would return potentially inaccurate, duplicated results. - Performance Overhead: Performing deduplication on the fly on large tables at query time significantly impacts query performance, directly affecting the responsiveness of our platform.
- Complexity in Data Pipelines: Maintaining custom logic for deduplication within our data pipelines added overhead and complexity, increasing the potential for errors and requiring more development effort.
The transition to ReplacingMergeTree was a game-changer. It allowed us to offload the deduplication responsibility directly to the database engine. By simply defining a version column and ensuring our ORDER BY
clause uniquely identifies an event, ReplacingMergeTree
handles the cleanup automatically during background merges. This means:
- Simplified Data Pipelines: Less custom deduplication logic is needed in our ingestion pipeline.
- Optimized Query Performance (with eventual consistency): While the final setting can be used with
ReplacingMergeTree
to force immediate deduplication during a query (at the cost of performance), the engine's background merges eventually eliminate duplicates. This means that for most analytical queries where real-time, absolute consistency isn't strictly required, the data is already clean, leading to better query performance. The performance penalty of using FINAL will also diminish over time as background merges reduce the number of duplicates. - Automated Maintenance: ClickHouse takes care of the merging and deduplication in the background, requiring less manual intervention and operational overhead from our engineering team. This also allows our engineers to concentrate on areas crucial to our product and value proposition.
By strategically combining API-level deduplication, application-level tracking, ClickHouse's insert_deduplication_token
, and the powerful ReplacingMergeTree
engine, Fiddler AI has built a highly scalable and reliable system for ingesting and deduplicating customer model inference events. This ensures the accuracy and integrity of the insights we deliver, empowering our users to make confident decisions about their AI models and applications.
Love diving deep into technical challenges? Come work with engineers who are passionate about AI, data infrastructure, and distributed systems. Check out our careers page.