OpenXData | Explore open data architectures

Live Virtual Conference

April 29, 2026 | 9am - 3pm PT

Featuring speakers from

Join the year’s premier education event
on open data architectures for data practitioners

Lineup

2 tracks of expert-led content
Panels
Workshops
Technical sessions

Topics

AI-native data platforms
Data engineering for AI
Cost and performance optimization at scale
Maximizing openness and interoperability in your data stack

Audience

Data & AI practitioners
Data engineers
Data architects
Data platform engineers
Analytics engineers

Speakers

Başak Tuğçe Eskili

Machine Learning Engineer

Tosh Rayadhurgam

Head of Advanced AI

Ruiyang Wang

Member of Technical Staff

Vamshi Pasunuru

Staff Software Engineer

Junping (JP) Du

CoFounder & CEO

Vinoth Chandar

CEO

Maxime Beauchemin

CEO

Simba Khadder

Head of Context Engine

Fei Han

Director of Real-Time Data Platform

Andrii Loievets

Staff Software Engineer

Revanth Chandupatla

Principal Engineer

Holden Karau

Principal Engineer OSS Spark

Sagar Lakshmipathy

Solutions Engineer

Satej Kumar Sahu

Principal Data Engineer

Kevin Liu

Principal Engineer

Aditi Pandit

Principal Engineer

Julien Le Dem

Principal Engineer

Mehul Batra

Software Engineer

Xinli Shang

Senior Staff Software Engineer

Kyle Weller

VP of Product

Yufei Gu

Staff Software Engineer

Melda Salhab

AI Senior Technical Program Manager

Rui Mo

Software Engineer

Dipankar Mazumdar

Director - Developers (Data/AI)

Junping Du

Co-Founder and CEO

Will Manning

Co-founder & CEO

Suman Debnath

Technical Lead (ML)

Ethan Guo

Data Infra Engineer

Chang She

CEO

Prashant Singh

Senior Software Engineer

Shawn Chang

Software Engineer

Will Angel

AI Engineer

Alex Jones

Tech lead ML platform

David Aronchick

CEO

Yuxia Luo

Software Engineer

Rahil Chertara

Senior Software Engineer

Tim Meehan

Staff Software Engineer

Timothy Brown

Database Engineer

Select Keynotes

9:30 AM

–

10:00 AM

PST

From Lakehouse to Agent Infrastructure: Data Platforms for the Age of Autonomous AI

9:30 AM

–

10:00 AM

PST

From Lakehouse to Agent Infrastructure: Data Platforms for the Age of Autonomous AI

The modern data platform has evolved through several architectural shifts. Warehouses optimized structured analytics. Data lakes unlocked scale and flexibility. The lakehouse unified transactional reliability with open data storage. But a new workload is now pushing data platforms into their next evolution: AI agents operating autonomously across enterprise systems.

Unlike dashboards or notebooks, AI agents retrieve context, run queries, reason over historical state, and trigger actions continuously—often at machine speed and with highly unpredictable access patterns. Most existing data architectures were never designed for this. As agents move into production, many enterprises are discovering a familiar failure mode: agents reaching directly into operational databases, SaaS systems, or transactional warehouses, creating cost blowups, reliability risks, and governance blind spots.

This talk explores the emergence of agent infrastructure as the next stage of the lakehouse evolution. Rather than monolithic platforms optimized for human-driven analytics, the future data platform must safely serve autonomous software systems operating everywhere. We’ll discuss the architectural primitives enabling this shift—unified storage for structured and unstructured data, versioned and incremental timelines for agent memory and auditability, and low-latency serving layers such as Onehouse LakeBase that allow AI agents to access enterprise data without impacting operational systems.

If the lakehouse defined modern analytics infrastructure, the next chapter is about building data platforms designed to power AI agents everywhere.

Vinoth Chandar

CEO

Onehouse

Vinoth Chandar

Onehouse

Track 1

10:00 AM

–

10:25 AM

PST

Guardrails for Agentic AI: Governing Auto-Generated SQL and Spark Jobs Before Production

Satej Kumar Sahu

Zalando

10:00 AM

–

10:25 AM

PST

Guardrails for Agentic AI: Governing Auto-Generated SQL and Spark Jobs Before Production

Agentic AI is rapidly moving from “assistive” to “autonomous”: natural-language requests are now translated into SQL and Spark programs that can execute end-to-end with little or no oversight. That shift changes the risk profile of modern data platforms. A single hallucinated join, an unbounded scan, a silent Cartesian product, or a poorly placed UDF can explode costs, violate access policies, or publish incorrect results—at machine speed.

This talk explores where governance should live when the generator is probabilistic and the runtime is powerful. We’ll compare three patterns: (1) “AI with AI” approaches using LLM judges and automated policy checks, (2) traditional human-in-the-loop review, and (3) a pragmatic middle path that combines deterministic controls with selective human escalation. We’ll outline a governance architecture that sits before execution and before productionization: a pre-flight layer that validates intent, enforces policy, and optimizes queries and code artifacts prior to deployment.

You’ll see how to implement layered guardrails—schema and lineage checks, access control verification, cost and performance estimation, semantic validation against known metrics, and regression tests on representative data—paired with explainability and approval workflows when confidence is low. We’ll also discuss how LLM judges can add value (e.g., intent consistency, anti-pattern detection, test generation) while keeping them bounded by verifiable rules and measurable acceptance criteria.

By the end, you’ll leave with a blueprint for a system that turns “prompt-to-prod” into a controlled pipeline: safer execution, predictable cost, better performance, and auditable accountability—without killing the productivity gains that made agentic AI attractive in the first place.

Satej Kumar Sahu

Principal Data Engineer

Zalando

10:25 AM

–

10:50 AM

PST

The Latest Architecture Evolution of Apache Hudi at JD.com

Fei Han

10:25 AM

–

10:50 AM

PST

The Latest Architecture Evolution of Apache Hudi at JD.com

This session provides an overview of the current state of data lake at JD.com. We will delve into our team's latest innovations within the Apache Hudi core, detailing the technical characteristics. We will also share practical business implementations that leverage these newly developed capabilities, followed by insights into our future roadmap.

Fei Han

Director of Real-Time Data Platform

10:50 AM

–

11:15 AM

PST

How Conductor transformed their data layer with Apache Hudi, Onehouse and Starrocks

Andrii Loievets

Conductor

10:50 AM

–

11:15 AM

PST

How Conductor transformed their data layer with Apache Hudi, Onehouse and Starrocks

In this talk, I will dive into the transformative journey of Conductor, transitioning from feature teams to platform teams. I will showcase the implementation of updated data pipelines and the creation of a Data Platform utilizing Apache Hudi and Onehouse. This platform enables us to fully harness the potential of our data, thereby enhancing user value significantly.The new architecture facilitates the rapid integration of new data sources and pipelines, enabling the development of innovative products based on AI search results from ChatGPT, Perplexity, and other AI engines. I will illustrate how open table formats have empowered us to experiment with consumption-optimized tools and demonstrate the efficient utilization of Starrocks for the dynamic generation of analytical reports within seconds.

Attendees will gain insights into the process of integrating a Data Lake with Apache Hudi and Onehouse into existing architectures, identifying successful strategies and areas for improvement, and several optimization techniques will be presented.

This presentation holds significant relevance for OpenXData, as it demonstrates the tangible benefits that businesses can derive from technical solutions grounded in open architectures such as Apache Hudi with Onehouse.

Andrii Loievets

Staff Software Engineer

Conductor

11:15 AM

–

11:30 AM

PST

Booking.com's ultra-low latency feature platform

Başak Tuğçe Eskili

Booking

11:15 AM

–

11:30 AM

PST

Booking.com's ultra-low latency feature platform

At Booking.com, we serve millions of real-time ML predictions every minute powering ranking, fraud detection, and personalization. To meet strict latency requirements, we built an ultra-low latency feature platform capable of serving features with p99.9 latency under 25ms at 200K requests per second.
In this talk, we share how we designed a high-throughput, self-service feature platform built on Amazon ElastiCache, enabling teams across Booking.com to serve ML features reliably at scale while maintaining strict latency SLOs.

Başak Tuğçe Eskili

Machine Learning Engineer

Booking

11:40 AM

–

12:10 PM

PST

The Spark You Knew Is Dead: Inside the Quiet Lakehouse Revolution

Kyle Weller

Onehouse

11:40 AM

–

12:10 PM

PST

The Spark You Knew Is Dead: Inside the Quiet Lakehouse Revolution

For years your open lakehouse has been quietly held back by a monolithic, JVM-bound execution engine that was never designed for the speed, freshness, or hardware reality of 2026. Slow scans, brittle ETL, and data that’s always yesterday’s news for analytics and AI. That’s not a storage problem. That’s an execution problem.

‍

But a quiet revolution is already underway inside the Spark ecosystem. In this talk we go deep under the hood: how Spark is transforming from a single all-in-one engine into a smart query planner and orchestration layer that delegates to a new generation of heterogeneous, native backends. We’ll unpack the shift to columnar memory, SIMD-optimized pipelines, substrait universal plans, GPU/CPU specialization, and execution engines that push optimizations all the way down to the Hudi and Iceberg storage layer.You’ll see how Onehouse’s Quanton is built for exactly this new world, a purpose-built native execution layer that eliminates the old JVM tax and delivers dramatic performance gains without ever moving or duplicating your data. Paired with our LakeBase low-latency serving layer, the result is a lakehouse that finally feels current: sub-second queries on fresh data, at a fraction of the cost, and truly AI-native.

‍

Expect unfiltered architectural deep-dives, concrete production numbers, and live demos that show exactly how this multi-engine future works today.

‍

The lakehouse revolution isn’t coming. It’s already happening quietly under the hood of Spark itself. Come see what’s next.

Kyle Weller

VP of Product

Onehouse

12:10 PM

–

12:25 PM

PST

Building a Personal Data Lakehouse

Will Angel

DroneDeploy

12:10 PM

–

12:25 PM

PST

Building a Personal Data Lakehouse

What is a Data Lakehouse and why did I decide to build one for myself? This talk covers the process of building a personal Data Lakehouse with open source tooling and explores data management and the architectural considerations for a data management system and how system requirements and characteristics differ in enterprise, corporate, and personal contexts.

Will Angel

AI Engineer

DroneDeploy

12:25 PM

–

12:50 PM

PST

Panel: Context Engineering is Just Data Engineering

Alex Jones

Monzo

Melda Salhab

Uber

Vinoth Chandar

Onehouse

David Aronchick

Expanso

12:25 PM

–

12:50 PM

PST

Panel: Context Engineering is Just Data Engineering

As AI systems — particularly large language models and autonomous agents — become core consumers of enterprise data, a new discipline has emerged: Context Engineering. It involves the systematic design, curation, retrieval, and dynamic delivery of relevant information (data, metadata, business logic, lineage, semantics, and institutional knowledge) into an AI model's limited context window so it can reason and act reliably.

‍

Is this truly a novel field, or is it simply data engineering evolved for the agentic AI era?

‍

This provocative panel will argue that context engineering is not a radical departure but a natural extension, and in many ways builds heavily on top of, traditional data engineering practices. Skills long mastered by data teams (building reliable pipelines, managing metadata and lineage, enforcing governance and quality, modeling semantics, and ensuring data is discoverable and trustworthy) are now the exact capabilities needed to make AI systems effective.

‍

The discussion will explore:

How data pipelines must shift from serving human analysts to serving AI agents as primary consumers
The role of open data standards, interoperability, and semantic layers in building high-quality context
Whether "context engineering" is mostly rebranded data engineering with new tools (RAG, vector stores, memory systems), or if it demands genuinely new mindsets and techniques
Implications for open data initiatives: Can truly open, well-contextualized data accelerate trustworthy AI?

‍

Join our panelists for a lively debate on whether the future of data engineering is simply learning to speak "AI-native," or if context engineering represents a fundamental redefinition of the discipline.

‍

Alex Jones

Tech lead ML platform

Monzo

Melda Salhab

AI Senior Technical Program Manager

Uber

Vinoth Chandar

CEO

Onehouse

David Aronchick

CEO

Expanso

12:50 PM

–

1:15 PM

PST

Vortex: Building GPU-Native Columnar Storage

Will Manning

Spiral (SpiralDB)

12:50 PM

–

1:15 PM

PST

Vortex: Building GPU-Native Columnar Storage

This talk will cover the Vortex, a Linux Foundation project building the fastest open-source file format on both CPUs and GPUs. We'll briefly cover Vortex's performance, and then break down how we built the format to support efficient late materialization on both GPUs & CPUs. The key insight of the talk is that designing for GPUs results in better CPU vectorization as well as random access.

Will Manning

Co-founder & CEO

Spiral (SpiralDB)

1:15 PM

–

1:45 PM

PST

What Happens to Your Data Architecture When Query Layer Starts Making Decisions

Tosh Rayadhurgam

ex Meta

1:15 PM

–

1:45 PM

PST

What Happens to Your Data Architecture When Query Layer Starts Making Decisions

Most data architectures were designed around a simple assumption: queries are deterministic. You send a query, you get a result. Agents break that assumption. The same input, the same data, a different output — sometimes subtly, sometimes catastrophically. In this talk, I'll walk through what that non-determinism actually means for open data architectures at scale: why your existing governance model doesn't account for it, why MCP changes the integration calculus but doesn't solve the trust problem, and what three specific architectural decisions teams should make before their agent layer is in production. No frameworks. No vision. Just the problems showing up in real systems right now and the patterns that are working.

Tosh Rayadhurgam

Head of Advanced AI

ex Meta

1:55 PM

–

2:20 PM

PST

Scalable Table Services @Uber

Vamshi Pasunuru

Uber

Xinli Shang

Uber

1:55 PM

–

2:20 PM

PST

Scalable Table Services @Uber

Table services—including Compaction, Cleaning, and Clustering—are critical to balancing ingestion latency with query performance. In this talk, we share our learnings from operating these services at Uber scale. We will dive into the technical architecture that powers our background maintenance and discuss how we decouple these operations to ensure data freshness and high-performance analytics on a massive scale.

Vamshi Pasunuru

Staff Software Engineer

Uber

Xinli Shang

Senior Staff Software Engineer

Uber

2:20 PM

–

2:35 PM

PST

Building multi-tenant, multi-cloud Streaming Engines at biggest retailer on this planet scale

Revanth Chandupatla

Walmart

2:20 PM

–

2:35 PM

PST

Building multi-tenant, multi-cloud Streaming Engines at biggest retailer on this planet scale

Data and AI workloads at Walmart are evolving rapidly and require data streaming at millions per second scale. Every day thousands of pipelines ingest data at trillion scale. This talk will focus on how we built a multi-cloud, multi-tenant self service Streaming Engines (Kafka Connect, Spark, Flink). Lessons learned and how we optimized a Petabyte worth compute reservation to few hundred terabytes to improve multi-tenancy and scalability.

Revanth Chandupatla

Principal Engineer

Walmart

2:35 PM

–

3:00 PM

PST

Anatomy of our Data Agent: How AI Support Analytics at Preset

Maxime Beauchemin

Preset

2:35 PM

–

3:00 PM

PST

Anatomy of our Data Agent: How AI Support Analytics at Preset

At Preset, we don't have a dedicated data team. Instead, we've built an AI agent, internally called "DatAgor," that handles a surprising amount of what a data team would do: answering ad-hoc analytics questions, building charts and dashboards, debugging pipeline issues, and helping anyone in the company self-serve on data. It's not magic, and it doesn't get everything right, but it's genuinely useful, and adoption across the team is growing fast.

This talk is an honest walkthrough of what we built, what works well, and where the gaps still are.

The foundation is an Agor Assistant, a persistent AI agent with access to our full data stack. Through Superset MCP, it can query data, build charts, and create dashboards, the same operations a human would do in the UI. We've layered in access to our dbt models and Airflow pipelines for semantic context and data freshness, a custom CLI skill for Superset operations, and internal documentation. When tasks get complex, it can dispatch sub-agents to work in parallel. And through Slack integration, anyone at Preset, engineers, PMs, execs, can just ask a question and get a useful answer back.

The honest version: it handles the bulk of routine analytics work remarkably well. Ad-hoc questions, straightforward dashboards, pipeline investigations, these are its sweet spot. But it still needs human review. It doesn't always have full business context. The dashboards it builds are functional, not beautiful. Complex analysis still requires back-and-forth and sometimes a human taking the wheel. The good news is that it learns. Through a memory and skill system, it gets better at the workflows your team actually uses.

We'll demo real workflows, share the architecture, and talk candidly about what it takes to run this in practice, not as a replacement for data expertise, but as a force multiplier that makes a small team punch well above its weight.

Key takeaways:
- Architecture of a production AI data agent: Superset MCP, pipeline context, sub-agent orchestration, and Slack access
- Where agentic analytics works great today and where it still needs a human in the loop
- How memory and skill systems help the agent improve over time on your team's specific workflows

Maxime Beauchemin

CEO

Preset

3:00 PM

–

3:25 PM

PST

Safe PDF Processing at Scale: A Rasterize-First Architecture

Ruiyang Wang

Anthropic

3:00 PM

–

3:25 PM

PST

Safe PDF Processing at Scale: A Rasterize-First Architecture

PDFs are one of the largest sources of unstructured data, but most pipelines treat them as trusted input. They're not. PDFs are arbitrary code, and parsing them is a real attack surface. This talk presents a rasterize-first architecture: sandbox the PDF, render to images, then OCR. I'll cover the threat model, the three-stage pipeline, accuracy benchmarks against traditional extraction, and lessons from scale. You'll leave with a practical approach to processing PDFs safely without sacrificing quality.

Ruiyang Wang

Member of Technical Staff

Anthropic

Track 2

10:00 AM

–

10:25 AM

PST

Building a Context Engine: Data Pipelines for Agents

Simba Khadder

Redis

10:00 AM

–

10:25 AM

PST

Building a Context Engine: Data Pipelines for Agents

We've built data pipelines for dashboards, analysts, and ML models. Agents are a fundamentally different consumer. RAG handles text retrieval well, but agents need relational data, APIs, event streams, and memory. A context engine makes all your data addressable to agents through one interface: agents navigate and find context dynamically, retrieve it wherever it lives, context improves with usage, and it's always fast. The data pipeline has a new end user.

Simba Khadder

Head of Context Engine

Redis

10:25 AM

–

10:50 AM

PST

Column Storage for the AI era

Julien Le Dem

Datadog

10:25 AM

–

10:50 AM

PST

Column Storage for the AI era

The past few years have brought a Cambrian explosion of new columnar formats challenging Parquet. Each promising optimizations for modern workloads.
Entering this AI-dominated era, some argue data infra designed before vector embeddings and GPU-optimized layouts won't cut it moving forward. While these projects incorporate genuine innovations, this framing overlooks what made the original successful. Parquet's main contribution isn't just technical.
The real opportunity is not in choosing between old and new, but in leveraging the communities of these established projects to absorb innovations while maintaining interoperability.

Julien Le Dem

Principal Engineer

Datadog

10:50 AM

–

11:15 AM

PST

Apache Gluten: Delivering Continuous Innovation in Big Data Analytics

Rui Mo

IBM

10:50 AM

–

11:15 AM

PST

Apache Gluten: Delivering Continuous Innovation in Big Data Analytics

Apache Gluten is an open-source project that brings the power of native execution to the modern data lakehouse. By offloading query execution from the JVM to high-performance C++ backends like Velox — and leveraging GPU acceleration — Gluten delivers major speedups for compute-intensive workloads while remaining fully compatible with popular engines such as Apache Spark and Flink.
In this session, we’ll introduce what Gluten is, how it works, and why it matters. You’ll learn about the architecture behind native and GPU-accelerated query execution, how Gluten integrates seamlessly into existing Spark and Delta Lake pipelines, and the performance gains users have seen in production.
Whether you're running analytics at scale, optimizing Delta Lake workloads, or just curious about the future of high-performance query engines, this talk will help you understand where Gluten fits in the open data ecosystem and how to get started.

Rui Mo

Software Engineer

IBM

11:15 AM

–

11:30 AM

PST

What is Really "Open" in an Open Lakehouse Architecture?

Dipankar Mazumdar

Cloudera

11:15 AM

–

11:30 AM

PST

What is Really "Open" in an Open Lakehouse Architecture?

There is a lot of excitement around the lakehouse architecture today, which unifies two mainstream data architectures - data warehouse & data lakes - promising to do more with less. All the major data vendors have embraced the use of open table formats, due to demand for the flexibility & openness promised by supporting an open format. Projects such as Apache Iceberg, Apache Hudi, and Delta Lake have been at the center of this shift. In addition, newer table formats like DuckLake and Lance have emerged. Together, these efforts have helped establish an open and adaptable foundation for data, enabling enterprises to choose compute engines based on their workload needs rather than being locked into proprietary storage formats.

However, as terms like “open table format” and “open data lakehouse” are often used interchangeably, there is a growing need for clarity and a deeper technical understanding of what openness actually means in this context. In this session, we will do a technical breakdown of the lakehouse architecture & and examine what truly brings openness to a data platform.

Dipankar Mazumdar

Director - Developers (Data/AI)

Cloudera

11:40 AM

–

11:55 AM

PST

The Physics of LLM Inference at Scale

Suman Debnath

Anyscale

11:40 AM

–

11:55 AM

PST

The Physics of LLM Inference at Scale

While many developers can download a model from Hugging Face, few grasp why latency spikes the moment concurrent users hit an endpoint. This talk explores the "physics" of LLM inference, emphasizing that engineering a production-grade service requires understanding hardware constraints rather than just adding more GPUs. We will analyze the fundamental split between the prefill phase, which is compute-bound (GEMM) due to parallel prompt processing, and the decode phase, which is severely memory-bound (GEMV) as it generates tokens one by one. Central to this is the KV Cache Crisis; while caching transforms generation from $O(n^2) to $O(n) per token, it introduces massive memory management challenges, often consuming more space than the model weights themselves.

To solve these bottlenecks, we will examine how continuous batching (iteration-level scheduling) and PagedAttention (block-based memory allocation) saturate GPUs by eliminating the "waiting waste" and fragmentation typical of static batching. We then bridge these single-GPU "physics" to distributed systems using Ray Serve, which provides the programmable infrastructure to manage heterogeneous replicas and offload CPU-bound work like tokenization to separate worker pools. The session concludes with a live demo featuring a production-ready deployment of Ray Serve and vLLM. Attendees will see how to move from single-node bottlenecks to a distributed, elastic system that maintains high throughput and low Inter-Token Latency (ITL) even under heavy concurrent load.

Suman Debnath

Technical Lead (ML)

Anyscale

11:55 AM

–

12:10 PM

PST

Lake, Stream, and Everything In Between: Apache Fluss and the Streaming Lakehouse

Mehul Batra

DigitalOcean

11:55 AM

–

12:10 PM

PST

Lake, Stream, and Everything In Between: Apache Fluss and the Streaming Lakehouse

Brief about Fluss (Streaming storage with Table as first-class citizen), features (Union Read, Zero ETL, Integration with Lakehouse & Table format), Architecture shift (Lambda to Kappa), Future work

Mehul Batra

Software Engineer

DigitalOcean

12:10 PM

–

12:25 PM

PST

Driving Iceberg Adoption with Open Catalog and Open Datasets

Kevin Liu

Microsoft

12:10 PM

–

12:25 PM

PST

Driving Iceberg Adoption with Open Catalog and Open Datasets

An open, publicly available Iceberg REST Catalog (IRC) paired with well‑curated, open datasets will help drive the next tranche of Iceberg adoption. This will be the new paradigm to get started with Iceberg.

This talk will walk through the architecture, how it works end‑to‑end, and the benefits for both new and existing Iceberg users.

We’ll cover a few exciting use cases, such as

- Simple UX (easiest way to get started with Iceberg on any engine with IRC)- Fair Benchmark (same dataset, different engines)
- Cloud agnostic (supports all object storage)
- Data sharing (simplest way to share data across engines, clouds, vendors)

Kevin Liu

Principal Engineer

Microsoft

12:25 PM

–

12:50 PM

PST

Managing Data at Exabyte Scale for AI Model Training

Chang She

LanceDB

12:25 PM

–

12:50 PM

PST

Managing Data at Exabyte Scale for AI Model Training

More and more enterprises are post-training AI models to be custom tailored for their domain and their users. Success in model training depends critically on establishing the right data flywheel to improve iteration speed. From exploration to curation, from feature engineering to training, current data infrastructure solutions are insufficient to support the full iteration loop. Instead, model development teams are wrestling with a hodge podge of single purpose tools and having to copy different parts of their data to different systems for different steps in their workflow. This is one of the most important reasons why so many enterprise AI model training efforts are delayed or fail. For this talk, we’ll cover how LanceDB solves this problem by managing all of your multimodal training data in a unified table and streamlining the entire exploration to GPU-loading path. Via customer case studies, we’ll outline what this new architecture looks like, practical recipes for success, and how it all fits into your existing enterprise data architecture.

Chang She

CEO

LanceDB

12:50 PM

–

1:15 PM

PST

Polaris Meets Hudi, Unifying Lakehouse Metadata Across Table Formats

Yufei Gu

Snowflake

Ethan Guo

Onehouse

12:50 PM

–

1:15 PM

PST

Polaris Meets Hudi, Unifying Lakehouse Metadata Across Table Formats

Apache Polaris is designed as a next generation open metadata and governance layer for modern lakehouses. While Polaris started with strong roots in Apache Iceberg, real world data platforms rarely live in a single table format. Apache Hudi remains a critical choice for ingestion heavy, streaming, and incremental processing workloads.
In this talk, we walk through how Polaris integrates with Apache Hudi to provide a unified catalog, consistent governance, and centralized metadata management across table formats. We will cover the motivation behind Hudi support, the architectural approach taken in Polaris, and the key challenges.
Attendees will see how Polaris enables format aware access control, discovery, and interoperability without forcing users to migrate or compromise on Hudi specific capabilities. We will also share lessons learned, current limitations, and what true multi format lakehouse governance looks like in practice.

Yufei Gu

Staff Software Engineer

Snowflake

Ethan Guo

Data Infra Engineer

Onehouse

1:15 PM

–

1:45 PM

PST

Panel: Open by Design: Community as the Foundation of Open Source Data

Ethan Guo

Onehouse

Yuxia Luo

Alibaba Cloud

Tim Meehan

IBM

Prashant Singh

Snowflake

Shawn Chang

AWS

1:15 PM

–

1:45 PM

PST

Panel: Open by Design: Community as the Foundation of Open Source Data

[Moderator] Ethan Guo, Apache Hudi PMC & Committer‍

‍[Panelist] Yuxia Luo | Software Engineer At Alibaba Cloud | Apache Flink Committer | Apache Fluss (Incubating) PPMC

‍‍[Panelist] Tim Meehan | Software Engineer at IBM | Chair of Technical Steering Committee Presto Foundation‍

‍[Panelist] Prashant Singh | Senior Software Engineer at Snowflake | Apache Iceberg PMC & Committer | Apache Polaris Committer | Apache Spark Contributor

‍‍[Panelist] Shawn Chang | Software Engineer at AWS | Apache Iceberg Committer | Apache Hudi Committer

‍

What does it take for an open source data project to truly thrive for the long haul? The secret isn't just in the code—it’s in the intentional design of its community. Join a panel of seasoned data ecosystem leaders as we deconstruct the “Open by Design” philosophy.

‍

We will move beyond the buzzwords to discuss the practicalities of building resilient governance, lowering barriers to contribution, and navigating the seismic shift toward AI-driven development. Whether you are a maintainer, a contributor, or a community lead, this session offers a blueprint for building a sustainable, community-led future.

Ethan Guo

Data Infra Engineer

Onehouse

Yuxia Luo

Software Engineer

Alibaba Cloud

Tim Meehan

Software Engineer

IBM

Prashant Singh

Senior Software Engineer

Snowflake

Shawn Chang

Software Engineer

AWS

1:55 PM

–

2:20 PM

PST

What’s new in Spark 4.2 / 4.3 and how to optimize your UDFS in Spark 4+

Holden Karau

Snowflake

1:55 PM

–

2:20 PM

PST

What’s new in Spark 4.2 / 4.3 and how to optimize your UDFS in Spark 4+

Spark 4 user defined functions (UDFs) have a bunch of new knobs for acceleration. This talk will look at the history of Spark user defined functions, from resilient distributed datas (RDDs) land where everything* was a lambda to today where we have ~5 types of UDFs (depending on how you count). We'll explore how to upgrade your UDFs for maximum performance (vectorization, code generation*, oh my!)

The final part of this talk will look the future as well (transpilation, RPCs, etc.)

Holden Karau

Principal Engineer OSS Spark

Snowflake

2:20 PM

–

2:35 PM

PST

Metadata as the Control Plane: The Foundation of an AI-Native Data Platform

Junping (JP) Du

Datastrato

2:20 PM

–

2:35 PM

PST

Metadata as the Control Plane: The Foundation of an AI-Native Data Platform

As AI systems evolve from models to autonomous agents, data platforms need more than storage and compute—they need a control plane. This session explains why metadata is becoming that control plane and the foundation of an AI-native data platform. Using Apache Gravitino as an example, we’ll show how a metadata-centric architecture unifies multi-cloud, multi-format, and multi-engine stacks, connecting formats like Iceberg, Hudi, and Lance with engines such as Spark, Trino, Ray, and Daft. You’ll learn how this approach enables AI agents and modern data platforms to safely discover, govern, and access data at scale.

Junping (JP) Du

Co-Founder and CEO

Datastrato

2:35 PM

–

3:00 PM

PST

Apache Hudi™ for the next generation of AI: Unstructured Data and Vector Search on open lakehouse storage

Rahil Chertara

Onehouse

Timothy Brown

General Intuiton

2:35 PM

–

3:00 PM

PST

Apache Hudi™ for the next generation of AI: Unstructured Data and Vector Search on open lakehouse storage

The open lakehouse was built around structured data leveraging file formats such as Apache Parquet and data formats like Apache Avro. The lakehouse table formats built on top of them were designed in an era before embeddings, multimodal training datasets, and AI-driven applications became mainstream. Today, enterprises building AI pipelines are forced to stitch together a fragmented stack: a vector database for similarity search, an object store for storing large blobs of image and video data, a document store for semi-structured data, and a lakehouse for everything else. The result is duplicated data, broken lineage, and no unified governance — all of which slow down the teams building AI data applications.

‍

The solution is not to abandon the lakehouse but to evolve it with the core primitives that AI workloads actually need. That means first-class data types: a typed VECTOR column that encodes dimension and element type; a BLOB type with inline or out-of-line storage and managed lifecycle semantics; and a VARIANT type for schema-flexible semi-structured data with optional type shredding for high-performance analytics. It also means embracing file formats designed for AI — like Lance, which support faster random access for writing/reading vector embeddings and blob data than Parquet.

‍

In this talk, we'll walk through the work we've done in Apache Hudi to make this a reality. We'll cover how VECTOR, BLOB, and VARIANT types are represented in Hudi's schema layer and stored in Parquet, how we now have supported the Lance file format support. We will also discuss how we added support for VECTOR_SEARCH via a new Spark SQL function —which enables k-nearest-neighbor search with cosine, L2, and dot product distance metrics directly against your lakehouse, no external vector database required. Attendees will leave with a concrete understanding of what an AI-native open lakehouse looks like and how to adopt these capabilities today for the next generation of data and ai workloads.

Rahil Chertara

Senior Software Engineer

Onehouse

Timothy Brown

Data Infrastructure Engineer

General Intuiton

3:00 PM

–

3:25 PM

PST

The latest on the Presto Native Engine

Aditi Pandit

IBM

3:00 PM

–

3:25 PM

PST

The latest on the Presto Native Engine

Presto Native Engine is a full rewrite of the Presto query execution engine (https://prestodb.io/). The goal is to make Presto the best open Lakehouse engine in the market by moving from the old Java query execution engine to a new C++ one. The change brings a 3-4x performance boost and improved scalability and reliability.

Presto Native engine has been in active development for a little over 5 years we have some production deployments at Meta/Uber and IBM now. This talk will cover the new engine architecture including its use of Velox which is a reusable data execution framework from Meta. It will give more insight into the performance benchmarking work done for it. It will go over the components and improvements to deploy the engine in the production systems.

Lastly, this project has a very active open-source community comprising engineers from Meta, Ahana/IBM, Uber, ByteDance, Pinterest, Intel. We want to invite new engineers to join our group and be part of the development.

Aditi Pandit

Principal Engineer

IBM

3:35PM

–

4:00PM

PST

Open Data using Onehouse Cloud

If you've ever tried to build a data lakehouse, you know it's no small task. You've got to tie...

3:35PM

–

4:00PM

PST

Open Data using Onehouse Cloud

If you've ever tried to build a data lakehouse, you know it's no small task. You've got to tie together file formats, table formats, storage platforms, catalogs, compute, and more. But what if there was an easy button?

Join this session to see how Onehouse delivers the Universal Data Lakehouse that is:

‍

Fast - Ingest and incrementally process data from stream, operational databases, and cloud storage with minute-level data freshness.

‍

Efficient - Innovative optimizations ensure that you squeeze every bit of performance out of your resources with a runtime optimized for lakehouse workloads.

‍

Simple - Onehouse is delivered as a fully managed cloud sevice, so you can spin up a production-ready lakehouse in days--or less.

‍

The session will include a live demo. Attendees will be elible for up to $1,000 in free credits to try Onehouse for their organization.

‍

Chandra Krishnan

Solutions Engineer

Onehouse

Chandra Krishnan

Onehouse

10:00 AM

–

10:25 AM

PST

Workshop: Supercharge Apache Spark on Kubernetes with the Quanton Operator

If you've ever tried to build a data lakehouse, you know it's no small task. You've got to tie...

10:00 AM

–

10:25 AM

PST

Workshop: Supercharge Apache Spark on Kubernetes with the Quanton Operator

This workshop will be run twice during the day, so you can choose the time that works best for you.
‍

In this hands-on workshop, we'll deploy the Quanton Kubernetes Operator (QKO) — a drop-in replacement for your existing Spark K8s operator that delivers 3–4x faster jobs without any code changes. We'll cover why teams are moving to Spark on Kubernetes, the performance tradeoffs of vanilla OSS Spark, and how Quanton closes that gap. You'll submit a real Spark job, learn how to take the same workload to production, and walk away with a running benchmark you can use to see the speedup on your own workloads.No prior Quanton experience needed.

‍

Pre-reqs: Docker installed on your laptop

Sagar Lakshmipathy

Solutions Engineer

Onehouse

Sagar Lakshmipathy

Onehouse

Workshop

10:00 AM

–

10:25 AM

PST

and

3:35 PM

–

4:00 PM

PST

Workshop: Supercharge Apache Spark on Kubernetes with the Quanton Operator

If you've ever tried to build a data lakehouse, you know it's no small task. You've got to tie...

10:00 AM

–

10:25 AM

PST

and

3:35 PM

–

4:00 PM

PST

Workshop: Supercharge Apache Spark on Kubernetes with the Quanton Operator

This workshop will be run twice during the day, so you can choose the time that works best for you.
‍

‍

Pre-reqs: Docker installed on your laptop

Sagar Lakshmipathy

Solutions Engineer

Onehouse

Sagar Lakshmipathy

Onehouse

Register Now

Secure your spot at the premier data practitioner event! Don’t miss out on expert insights, hands-on workshops, and networking opportunities.

Apply to speak to our next edition (Fall 2026)

Are you a data practitioner with real-world lessons from designing, building, or operating the open data stack? We’d love to hear from you.

Themes we’re excited about:

AI-native data platforms
Data engineering for AI
Cost and performance optimization at scale
Maximizing openness and interoperability in your data stack

Apply to speak

Featuring speakers from

Join the year’s premier education event
on open data architectures for data practitioners

Lineup

Topics

Audience

Speakers

Select Keynotes

Lorem Ipsum Dolor Sit Amet

Lorem Ipsum Dolor Sit Amet

Workshop

Full agenda coming soon!

Workshop

Full agenda coming soon!

Register Now

Apply to speak to our next edition (Fall 2026)

Register Now

Featuring speakers from

Join the year’s premier education eventon open data architectures for data practitioners

Lineup

Topics

Audience

Speakers

Select Keynotes

Lorem Ipsum Dolor Sit Amet

Lorem Ipsum Dolor Sit Amet

Workshop

Full agenda coming soon!

Workshop

Full agenda coming soon!

Register Now

Apply to speak to our next edition (Fall 2026)

Join the year’s premier education event
on open data architectures for data practitioners