Stuttgart, Germany: Mercedes-Benz develops petabyte-scale RAG-compatible database model on Databricks platform

Shoppers of enterprise data platforms are discovering a new way to tame car telemetry at scale, as Databricks and Mercedes‑Benz publish a hierarchical semantic data model that promises faster, cheaper petabyte analytics. This guide explains what the model does, why RLE plus Liquid Clustering mattered in real Mercedes‑Benz tests, and how teams can pick layouts and optimisations that actually save time and money.

Hierarchical approach: Multi-level metadata tables let you prune millions of sessions down to a handful before touching heavy time series, so queries feel faster and more focused.
Space versus speed trade-off: Run‑length encoding (RLE) plus Liquid Clustering gave the best runtimes in Mercedes‑Benz benchmarks, but added roughly 14% storage vs Z‑Ordering; it still felt worth it.
Real-world scale: Tests used 64.55 TB of raw MDF files from 21 vehicles covering 40,000 hours, so findings are grounded in productive data not synthetic samples.
Practical win: For high-frequency, variable signals RLE cut runtimes by up to ~70% under Z‑Ordering and still gave large savings with Liquid Clustering; results matter most when thousands of engineers need fast, repeatable insights.
Safety note: Smaller Delta file targets and zstd compression help dynamic file pruning and prevent slow selective queries.

Why this new semantic model reads a vehicle like a pulse

Databricks and Mercedes‑Benz framed vehicle telemetry as a hierarchical, semantic problem, not just a bulk storage one. The model stores time series samples in a narrow table with container_id and channel_id, while contextual metadata lives in a set of metric and tag tables. The result reads more like a patient record than a raw log: you can filter by vehicle, session, then signal, narrowing the data you hit by orders of magnitude. That means less noise, faster queries, and engineers seeing meaningful patterns sooner , which feels reassuring when you’re diagnosing an intermittent lane‑change or tuning an electric‑range model.

Behind the scenes the team leaned on ASAM standards and Mercedes‑Benz production data. That gave practical constraints , split containers for huge files, multiple ECUs streaming at 1–100 Hz , and produced a model validated on real development vehicle workflows. Owners of telemetry platforms will recognise the emotional payoff: analysis that used to feel like slogging through hay now surfaces clear signals quickly.

How the three‑level metadata filtering changes query behaviour

The clever part is the three filtering levels: test object (vehicle), measurement session, and signal. Each level has metrics and tags to prune the search space. For example, to find sessions relevant to Automatic Lane Change readiness you first limit by date and vehicle attributes, then by sessions where Active Steering Assist was activated, then by road type and speed range. By the time you join the heavy signal table, only a tiny slice needs processing.

This stepwise narrowing is a simple idea but feels like magic at scale , you avoid scanning terabytes just to discover a handful of candidate sessions. For teams running many ad hoc analyses, that translates into less cloud spend and more exploration without waiting for jobs to finish.

Why RLE matters and when it doesn’t

Run‑length encoding compresses consecutive identical samples into single rows with a tstart and tend, making operations like histograms or duration sums far cheaper for many signals. In Mercedes‑Benz tests, RLE cut runtimes dramatically for both histogram and ALC readiness queries , up to around 70% for Z‑Ordering and substantial savings for Liquid Clustering too.

But RLE isn’t a universal silver bullet. Signals that change constantly (high entropy) or datasets where many signals are already event-driven see little benefit. That’s why benchmarking on your own telemetry is vital , the Databricks study used productive MDF files so you can see where RLE helps in practice.

Liquid Clustering versus Z‑Ordering , what the benchmarks showed

The benchmarking compared two main optimisations: hive‑style partitioning plus Z‑Ordering on signal_id, and Liquid Clustering on measurement_session_id and signal_id. Across use cases, Liquid Clustering most often delivered the best runtimes, especially when combined with RLE. It did, however, increase storage: RLE Liquid Clustering tables were about 77 TB vs 67 TB for RLE Z‑Ordering on the same raw 64.55 TB input.

In short, Liquid Clustering feels faster and more responsive for selective, session‑driven queries, while Z‑Ordering remains attractive for slightly lower storage overheads. The practical takeaway is to prefer Liquid Clustering when query latency matters and you can afford modest extra storage; choose Z‑Ordering when storage is the overriding constraint.

How to pick Delta table layout and cluster size for production

Start with the access patterns: are most queries session‑centric and highly selective, or broad scans and aggregations? If you need fast, ad hoc session filtering, aim for RLE plus Liquid Clustering and set a smallish Delta target file size (Databricks used 32 MB) so pruning works well. Use zstd compression for Parquet to balance storage and CPU cost.

Cluster sizing matters too. Benchmarks doubled node count across runs and saw nearly ideal scaling from X‑Small to Large, though very fast layouts (RLE + Liquid Clustering) show diminishing returns as baseline runtimes get tiny. The rule of thumb: scale until you hit reasonable runtime targets, but benchmark each change , even small optimisations yield big cost wins at petabyte scale.

Memory and governance considerations for a fleet or multi‑team setup

The Mercedes‑Benz work emphasised memory‑optimised nodes to keep metadata hot in Delta cache, which helps pruning and low‑latency filtering. Equally important is governance: Unity Catalog or an equivalent is needed to let non‑SQL engineers discover datasets safely while protecting PII and proprietary signals. Databricks plans to publish conversion jobs, governance how‑tos, and low‑code frameworks so more engineers can use the model without deep SQL or Python skills.

Governance isn’t glamorous but it’s tactile: good access controls, discoverability, and standardised tag schemas mean fewer accidental queries that blow budgets and more repeatable results for product teams.

What this means for engineers, data teams and product owners

For engineers, the model makes targeted investigations faster and less painful; for data teams, it’s a roadmap to lower query costs and better performance; for product owners, it shortens the loop from telemetry to feature decisions. The Mercedes‑Benz benchmarks show the model works on productive data, not just lab tests, which is the critical difference when rolling solutions into development pipelines.

And yes, there’s a human side: faster iterations mean happier engineers and quicker fixes, which in turn can improve vehicle safety and user experience sooner.

Ready to make telemetry analysis less painful and more productive? Check current best practices, test RLE and Liquid Clustering on a representative slice of your own data, and measure both runtime and storage before you commit to a layout.

Noah Fact Check Pro

The draft above was created using the information available at the time the story first
emerged. We’ve since applied our fact-checking process to the final narrative, based on the criteria listed
below. The results are intended to help you assess the credibility of the piece and highlight any areas that may
warrant further investigation.

Freshness check

Score:
10

Notes:
The narrative is recent, published on the Databricks blog on November 11, 2025. No earlier versions or similar content were found, indicating high freshness. The content appears original, with no signs of being recycled or republished. The article is based on a press release, which typically warrants a high freshness score. No discrepancies in figures, dates, or quotes were identified. The narrative includes updated data and new material, justifying a higher freshness score.

Quotes check

Score:
10

Notes:
No direct quotes were identified in the narrative. The content is presented in a descriptive and analytical manner without attributed statements.

Source reliability

Score:
10

Notes:
The narrative originates from Databricks, a reputable organisation known for its expertise in data analytics and AI. This enhances the credibility and reliability of the information presented.

Plausability check

Score:
10

Notes:
The claims made in the narrative are plausible and align with current advancements in data analytics and automotive technology. The technical details provided are consistent with known industry practices and Databricks’ capabilities. The language and tone are appropriate for the subject matter and target audience.

Overall assessment

Verdict (FAIL, OPEN, PASS): PASS

Confidence (LOW, MEDIUM, HIGH): HIGH

Summary:
The narrative is recent, original, and originates from a reputable organisation, enhancing its credibility. The content is plausible, with no discrepancies or signs of disinformation. The absence of direct quotes and the technical nature of the content suggest it is based on original research or internal findings. Overall, the narrative passes the fact-checking criteria with high confidence.

Why this new semantic model reads a vehicle like a pulse

How the three‑level metadata filtering changes query behaviour

Why RLE matters and when it doesn’t

Liquid Clustering versus Z‑Ordering , what the benchmarks showed

How to pick Delta table layout and cluster size for production

Memory and governance considerations for a fleet or multi‑team setup

What this means for engineers, data teams and product owners

Noah Fact Check Pro

Freshness check

Quotes check

Source reliability

Plausability check

Overall assessment

UK government names 65 London employers for wage underpayment, largest fine levied on Adecco UK Ltd

How everyday politics shapes personal lives in profound and surprising ways

Business Secretary draws parallels between Nigel Farage and Enoch Powell amid rising far-right rhetoric

Oriol Vinyals’ cryptic tweet hints at upcoming AI breakthroughs with market impact

Reeves to introduce new levy on high-value homes in upcoming UK budget

Thamesmead set for transformative DLR extension to boost connectivity and development

Stuttgart, Germany: Mercedes-Benz develops petabyte-scale RAG-compatible database model on Databricks platform

Why this new semantic model reads a vehicle like a pulse

How the three‑level metadata filtering changes query behaviour

Why RLE matters and when it doesn’t

Liquid Clustering versus Z‑Ordering , what the benchmarks showed

How to pick Delta table layout and cluster size for production

Memory and governance considerations for a fleet or multi‑team setup

What this means for engineers, data teams and product owners

Noah Fact Check Pro

Freshness check

Quotes check

Source reliability

Plausability check

Overall assessment

Keep Reading

UK government names 65 London employers for wage underpayment, largest fine levied on Adecco UK Ltd

How everyday politics shapes personal lives in profound and surprising ways

Business Secretary draws parallels between Nigel Farage and Enoch Powell amid rising far-right rhetoric

Oriol Vinyals’ cryptic tweet hints at upcoming AI breakthroughs with market impact

Reeves to introduce new levy on high-value homes in upcoming UK budget

Thamesmead set for transformative DLR extension to boost connectivity and development

Subscribe to Updates