Featuring speakers from

Join the year’s premier education event
on open data architectures for data practitioners

Lineup

  • 2 tracks of expert-led content
  • Panels
  • Workshops
  • Technical sessions

Topics

  • AI-native data platforms
  • Data engineering for AI
  • Cost and performance optimization at scale
  • Maximizing openness and interoperability in your data stack

Audience

  • Data & AI practitioners
  • Data engineers
  • Data architects
  • Data platform engineers
  • Analytics engineers

Speakers

Başak Tuğçe Eskili Photo
Başak Tuğçe Eskili
Machine Learning Engineer
Booking Logo
Tosh Rayadhurgam Photo
Tosh Rayadhurgam
Engineering Leader - Ranking & Foundational AI
Meta Logo
Ruiyang's image
Ruiyang Wang
Technical Staff
Anthropic's logo
Vamshi Pasunuru Photo
Vamshi Pasunuru
Staff Software Engineer
Uber Logo
JD's image
Junping (JD) Du
CoFounder & CEO
Datastrato's logo
Vinoth Chandar Photo
Vinoth Chandar
CEO
Onehouse Logo
Maxime's image
Maxime Beauchemin
CEO
Preset's logo
Simba Khadder Photo
Simba Khadder
Context Engine
Redis Logo
Fei's image
Fei Han
Director of Real-Time Data Platform
JD's logo
Andrii's image
Andrii Loievets
Director Software Engineering
Conductor's logo
Revanth Chandupatla Photo
Revanth Chandupatla
Principal Engineer
Walmart Logo
Holden Karau Photo
Holden Karau
Open Source Engineer
Snowflake Logo
Satej Kumar Sahu Photo
Satej Kumar Sahu
Principal Data Engineer
Zalando Logo
Junping Du Photo
Kevin Liu
Principal Software Engineer
Kevin Liu Logo
Aditi Pandit Photo
Aditi Pandit
Principal Engineer
IBM Logo
Julien Le Dem Phooto
Julien Le Dem
Principal Engineer
Datadog Logo
Mehul Batra Photo
Mehul Batra
Software Engineer
DigitalOcean Logo
Xinli Shang Photo
Xinli Shang
Senior Staff Software Engineer
Uber Logo
Kyle Weller Photo
Kyle Weller
VP of Product
Onehouse Logo
Yufei Gu Photo
Yufei Gu
Staff Software Engineer
Snowflake Logo
Rui Mo Photo
Rui Mo
Software Engineer
IBM Logo
Dipankar Mazumdar Photo
Dipankar Mazumdar
Director - Developers (Data/AI)
Cloudera Logo
Junping Du Photo
Junping Du
CoFounder & CEO
Datastrato Logo
Will Manning Photo
Will Manning
Co-founder & CEO
Spiral Logo
Suman Debnath Photo
Suman Debnath
Technical Lead (ML)
Anyscale Logo
Chang She Photo
Chang She
CEO
LanceDB Logo
Will Angel Photo
Will Angel
AI Engineer
DroneDeploy Logo
Yuxia's image
Yuxia Luo
Software Engineer
Alibaba Cloud's logo
Rahil's logo
Rahil Chertara
Senior Software Engineer
Onehouse's logo
Tim's logo
Tim Meehan
Software Engineer
IBM's logo

Select Keynotes

From Lakehouse to Agent Infrastructure: Data Platforms for the Age of Autonomous AI
From Lakehouse to Agent Infrastructure: Data Platforms for the Age of Autonomous AI

The modern data platform has evolved through several architectural shifts. Warehouses optimized structured analytics. Data lakes unlocked scale and flexibility. The lakehouse unified transactional reliability with open data storage. But a new workload is now pushing data platforms into their next evolution: AI agents operating autonomously across enterprise systems.

Unlike dashboards or notebooks, AI agents retrieve context, run queries, reason over historical state, and trigger actions continuously—often at machine speed and with highly unpredictable access patterns. Most existing data architectures were never designed for this. As agents move into production, many enterprises are discovering a familiar failure mode: agents reaching directly into operational databases, SaaS systems, or transactional warehouses, creating cost blowups, reliability risks, and governance blind spots.

This talk explores the emergence of agent infrastructure as the next stage of the lakehouse evolution. Rather than monolithic platforms optimized for human-driven analytics, the future data platform must safely serve autonomous software systems operating everywhere. We’ll discuss the architectural primitives enabling this shift—unified storage for structured and unstructured data, versioned and incremental timelines for agent memory and auditability, and low-latency serving layers such as Onehouse LakeBase that allow AI agents to access enterprise data without impacting operational systems.

If the lakehouse defined modern analytics infrastructure, the next chapter is about building data platforms designed to power AI agents everywhere.

Vinoth Chandar
CEO
,  
Onehouse
,  
,  
,  
Watch now
Track 1

Lorem Ipsum Dolor Sit Amet

Safe PDF Processing at Scale: A Rasterize-First Architecture
Safe PDF Processing at Scale: A Rasterize-First Architecture

PDFs are one of the largest sources of unstructured data, but most pipelines treat them as trusted input. They're not. PDFs are arbitrary code, and parsing them is a real attack surface. This talk presents a rasterize-first architecture: sandbox the PDF, render to images, then OCR. I'll cover the threat model, the three-stage pipeline, accuracy benchmarks against traditional extraction, and lessons from scale. You'll leave with a practical approach to processing PDFs safely without sacrificing quality.

Ruiyang's image
Ruiyang Wang
Technical Staff
,  
Anthropic
,  
,  
,  
Watch now
Scalable Table Services @Uber
Scalable Table Services @Uber

Table services—including Compaction, Cleaning, and Clustering—are critical to balancing ingestion latency with query performance. In this talk, we share our learnings from operating these services at Uber scale. We will dive into the technical architecture that powers our background maintenance and discuss how we decouple these operations to ensure data freshness and high-performance analytics on a massive scale.

Vamshi's image
Vamshi Pasunuru
Staff Software Engineer
,  
Uber
Xinli's image
Xinli Shang
Senior Staff Software Engineer
,  
Uber
,  
,  
Watch now
Building multi-tenant, multi-cloud Streaming Engines at Fortune one scale
Building multi-tenant, multi-cloud Streaming Engines at Fortune one scale

Data and AI workloads at Walmart are evolving rapidly and require data streaming at millions per second scale. Every day thousands of pipelines ingest data at trillion scale. This talk will focus on how we built a multi-cloud, multi-tenant self service Streaming Engines (Kafka Connect, Spark, Flink). Lessons learned and how we optimized a Petabyte worth compute reservation to few hundred terabytes to improve multi-tenancy and scalability.

Revanth's image
Revanth Chandupatla
Principal Engineer
,  
Walmart
,  
,  
,  
Watch now
How Conductor transformed their data layer with Apache Hudi, Onehouse and Starrocks
How Conductor transformed their data layer with Apache Hudi, Onehouse and Starrocks

In this talk, I will dive into the transformative journey of Conductor, transitioning from feature teams to platform teams. I will showcase the implementation of updated data pipelines and the creation of a Data Platform utilizing Apache Hudi and Onehouse. This platform enables us to fully harness the potential of our data, thereby enhancing user value significantly.The new architecture facilitates the rapid integration of new data sources and pipelines, enabling the development of innovative products based on AI search results from ChatGPT, Perplexity, and other AI engines. I will illustrate how open table formats have empowered us to experiment with consumption-optimized tools and demonstrate the efficient utilization of Starrocks for the dynamic generation of analytical reports within seconds.

Attendees will gain insights into the process of integrating a Data Lake with Apache Hudi and Onehouse into existing architectures, identifying successful strategies and areas for improvement, and several optimization techniques will be presented.

This presentation holds significant relevance for OpenXData, as it demonstrates the tangible benefits that businesses can derive from technical solutions grounded in open architectures such as Apache Hudi with Onehouse.

Dominik's Image
Andrii Loievets
Director Software Engineering
,  
Conductor
,  
,  
,  
Watch now
The Latest Architecture Evolution of Apache Hudi at JD.com
The Latest Architecture Evolution of Apache Hudi at JD.com

This session provides an overview of the current state of data lake at JD.com. We will delve into our team's latest innovations within the Apache Hudi core, detailing the technical characteristics. We will also share practical business implementations that leverage these newly developed capabilities, followed by insights into our future roadmap.

Fei's Image
Fei Han
Director of Real-Time Data Platform
,  
JD
,  
,  
,  
Watch now
Guardrails for Agentic AI: Governing Auto-Generated SQL and Spark Jobs Before Production
Guardrails for Agentic AI: Governing Auto-Generated SQL and Spark Jobs Before Production

Agentic AI is rapidly moving from “assistive” to “autonomous”: natural-language requests are now translated into SQL and Spark programs that can execute end-to-end with little or no oversight. That shift changes the risk profile of modern data platforms. A single hallucinated join, an unbounded scan, a silent Cartesian product, or a poorly placed UDF can explode costs, violate access policies, or publish incorrect results—at machine speed.

This talk explores where governance should live when the generator is probabilistic and the runtime is powerful. We’ll compare three patterns: (1) “AI with AI” approaches using LLM judges and automated policy checks, (2) traditional human-in-the-loop review, and (3) a pragmatic middle path that combines deterministic controls with selective human escalation. We’ll outline a governance architecture that sits before execution and before productionization: a pre-flight layer that validates intent, enforces policy, and optimizes queries and code artifacts prior to deployment.

You’ll see how to implement layered guardrails—schema and lineage checks, access control verification, cost and performance estimation, semantic validation against known metrics, and regression tests on representative data—paired with explainability and approval workflows when confidence is low. We’ll also discuss how LLM judges can add value (e.g., intent consistency, anti-pattern detection, test generation) while keeping them bounded by verifiable rules and measurable acceptance criteria.

By the end, you’ll leave with a blueprint for a system that turns “prompt-to-prod” into a controlled pipeline: safer execution, predictable cost, better performance, and auditable accountability—without killing the productivity gains that made agentic AI attractive in the first place.

Satej Kumar Sahu Photo
Satej Kumar Sahu
Principal Data Engineer
,  
Zalando
,  
,  
,  
Watch now
Vortex: Building GPU-Native Columnar Storage
Vortex: Building GPU-Native Columnar Storage

This talk will cover the Vortex, a Linux Foundation project building the fastest open-source file format on both CPUs and GPUs. We'll briefly cover Vortex's performance, and then break down how we built the format to support efficient late materialization on both GPUs & CPUs. The key insight of the talk is that designing for GPUs results in better CPU vectorization as well as random access.

Will Manning Photo
Will Manning
Co-founder & CEO
,  
Spiral (SpiralDB)
,  
,  
,  
Watch now
Building a Personal Data Lakehouse
Building a Personal Data Lakehouse

What is a Data Lakehouse and why did I decide to build one for myself?  This talk covers the process of building a personal Data Lakehouse with open source tooling and explores data management and the architectural considerations for a data management system and how system requirements and characteristics differ in enterprise, corporate, and personal contexts.

Will Angel Photo
Will Angel
AI Engineer
,  
DroneDeploy
,  
,  
,  
Watch now
Booking.com's ultra-low latency feature platform
Booking.com's ultra-low latency feature platform

At Booking.com, we serve millions of real-time ML predictions every minute powering ranking, fraud detection, and personalization. To meet strict latency requirements, we built an ultra-low latency feature platform capable of serving features with p99.9 latency under 25ms at 200K requests per second.
In this talk, we share how we designed a high-throughput, self-service feature platform built on Amazon ElastiCache, enabling teams across Booking.com to serve ML features reliably at scale while maintaining strict latency SLOs.

Başak Tuğçe Eskili Photo
Başak Tuğçe Eskili
Machine Learning Engineer
,  
Booking
,  
,  
,  
Watch now
What Happens to Your Data Architecture When Query Layer Starts Making Decisions
What Happens to Your Data Architecture When Query Layer Starts Making Decisions

Most data architectures were designed around a simple assumption: queries are deterministic. You send a query, you get a result. Agents break that assumption. The same input, the same data, a different output — sometimes subtly, sometimes catastrophically. In this talk, I'll walk through what that non-determinism actually means for open data architectures at scale: why your existing governance model doesn't account for it, why MCP changes the integration calculus but doesn't solve the trust problem, and what three specific architectural decisions teams should make before their agent layer is in production. No frameworks. No vision. Just the problems showing up in real systems right now and the patterns that are working.

Tosh Rayadhurgam Photo
Tosh Rayadhurgam
Engineering Leader - Ranking & Foundational AI
,  
Meta
,  
,  
,  
Watch now
Anatomy of our Data Agent: How AI Support Analytics at Preset
Anatomy of our Data Agent: How AI Support Analytics at Preset

At Preset, we don't have a dedicated data team. Instead, we've built an AI agent, internally called "DatAgor," that handles a surprising amount of what a data team would do: answering ad-hoc analytics questions, building charts and dashboards, debugging pipeline issues, and helping anyone in the company self-serve on data. It's not magic, and it doesn't get everything right, but it's genuinely useful, and adoption across the team is growing fast.

This talk is an honest walkthrough of what we built, what works well, and where the gaps still are.

The foundation is an Agor Assistant, a persistent AI agent with access to our full data stack. Through Superset MCP, it can query data, build charts, and create dashboards, the same operations a human would do in the UI. We've layered in access to our dbt models and Airflow pipelines for semantic context and data freshness, a custom CLI skill for Superset operations, and internal documentation. When tasks get complex, it can dispatch sub-agents to work in parallel. And through Slack integration, anyone at Preset, engineers, PMs, execs, can just ask a question and get a useful answer back.

The honest version: it handles the bulk of routine analytics work remarkably well. Ad-hoc questions, straightforward dashboards, pipeline investigations, these are its sweet spot. But it still needs human review. It doesn't always have full business context. The dashboards it builds are functional, not beautiful. Complex analysis still requires back-and-forth and sometimes a human taking the wheel. The good news is that it learns. Through a memory and skill system, it gets better at the workflows your team actually uses.

We'll demo real workflows, share the architecture, and talk candidly about what it takes to run this in practice, not as a replacement for data expertise, but as a force multiplier that makes a small team punch well above its weight.

Key takeaways:
- Architecture of a production AI data agent: Superset MCP, pipeline context, sub-agent orchestration, and Slack access
- Where agentic analytics works great today and where it still needs a human in the loop
- How memory and skill systems help the agent improve over time on your team's specific workflows

Maxime Beauchemin Photo
Maxime Beauchemin
CEO
,  
Preset
,  
,  
,  
Watch now
Track 2

Lorem Ipsum Dolor Sit Amet

What’s new in Spark 4.2 / 4.3 and how to optimize your UDFS in Spark 4+
What’s new in Spark 4.2 / 4.3 and how to optimize your UDFS in Spark 4+

Spark 4  user defined functions (UDFs) have a bunch of new knobs for acceleration. This talk will look at the history of Spark user defined functions, from resilient distributed datas (RDDs) land where everything* was a lambda to today where we have ~5 types of UDFs (depending on how you count). We'll explore how to upgrade your UDFs for maximum performance (vectorization, code generation*, oh my!)

The final part of this talk will look the future as well (transpilation, RPCs, etc.)

Holden's Image
Holden Karau
Open Source Engineer
,  
Snowflake
,  
,  
,  
Watch now
Driving Iceberg Adoption with Open Catalog and Open Datasets
Driving Iceberg Adoption with Open Catalog and Open Datasets

An open, publicly available Iceberg REST Catalog (IRC) paired with well‑curated, open datasets will help drive the next tranche of Iceberg adoption. This will be the new paradigm to get started with Iceberg.

This talk will walk through the architecture, how it works end‑to‑end, and the benefits for both new and existing Iceberg users.

We’ll cover a few exciting use cases, such as

- Simple UX (easiest way to get started with Iceberg on any engine with IRC)- Fair Benchmark (same dataset, different engines)
- Cloud agnostic (supports all object storage)
- Data sharing (simplest way to share data across engines, clouds, vendors)

Kevin's Image
Kevin Liu
Principal Software Engineer
,  
Microsoft
,  
,  
,  
Watch now
Column Storage for the AI era
Column Storage for the AI era

The past few years have brought a Cambrian explosion of new columnar formats challenging Parquet. Each promising optimizations for modern workloads.
Entering this AI-dominated era, some argue data infra designed before vector embeddings and GPU-optimized layouts won't cut it moving forward. While these projects incorporate genuine innovations, this framing overlooks what made the original successful. Parquet's main contribution isn't just technical.
The real opportunity is not in choosing between old and new, but in leveraging the communities of these established projects to absorb innovations while maintaining interoperability.

Julien Le Dem Photo
Julien Le Dem
Principal Engineer
,  
Datadog
,  
,  
,  
Watch now
Lake, Stream, and Everything In Between: Apache Fluss and the Streaming Lakehouse
Lake, Stream, and Everything In Between: Apache Fluss and the Streaming Lakehouse

Brief about Fluss (Streaming storage with Table as first-class citizen), features (Union Read, Zero ETL, Integration with Lakehouse & Table format), Architecture shift (Lambda to Kappa), Future work

Mehul Batra Photo
Mehul Batra
Software Engineer
,  
DigitalOcean
,  
,  
,  
Watch now
Polaris Meets Hudi, Unifying Lakehouse Metadata Across Table Formats
Polaris Meets Hudi, Unifying Lakehouse Metadata Across Table Formats

Apache Polaris is designed as a next generation open metadata and governance layer for modern lakehouses. While Polaris started with strong roots in Apache Iceberg, real world data platforms rarely live in a single table format. Apache Hudi remains a critical choice for ingestion heavy, streaming, and incremental processing workloads.
In this talk, we walk through how Polaris integrates with Apache Hudi to provide a unified catalog, consistent governance, and centralized metadata management across table formats. We will cover the motivation behind Hudi support, the architectural approach taken in Polaris, and the key challenges.
Attendees will see how Polaris enables format aware access control, discovery, and interoperability without forcing users to migrate or compromise on Hudi specific capabilities. We will also share lessons learned, current limitations, and what true multi format lakehouse governance looks like in practice.

Yufei Gu Photo
Yufei Gu
Staff Software Engineer
,  
Snowflake
,  
,  
,  
Watch now
Apache Gluten: Delivering Continuous Innovation in Big Data Analytics
Apache Gluten: Delivering Continuous Innovation in Big Data Analytics

Apache Gluten is an open-source project that brings the power of native execution to the modern data lakehouse. By offloading query execution from the JVM to high-performance C++ backends like Velox — and leveraging GPU acceleration — Gluten delivers major speedups for compute-intensive workloads while remaining fully compatible with popular engines such as Apache Spark and Flink.
In this session, we’ll introduce what Gluten is, how it works, and why it matters. You’ll learn about the architecture behind native and GPU-accelerated query execution, how Gluten integrates seamlessly into existing Spark and Delta Lake pipelines, and the performance gains users have seen in production.
Whether you're running analytics at scale, optimizing Delta Lake workloads, or just curious about the future of high-performance query engines, this talk will help you understand where Gluten fits in the open data ecosystem and how to get started.

Rui Mo Photo
Rui Mo
Software Engineer
,  
IBM
,  
,  
,  
Watch now
What is Really "Open" in an Open Lakehouse Architecture?
What is Really "Open" in an Open Lakehouse Architecture?

There is a lot of excitement around the lakehouse architecture today, which unifies two mainstream data architectures - data warehouse & data lakes - promising to do more with less. All the major data vendors have embraced the use of open table formats, due to demand for the flexibility & openness promised by supporting an open format. Projects such as Apache Iceberg, Apache Hudi, and Delta Lake have been at the center of this shift. In addition, newer table formats like DuckLake and Lance have emerged. Together, these efforts have helped establish an open and adaptable foundation for data, enabling enterprises to choose compute engines based on their workload needs rather than being locked into proprietary storage formats.

However, as terms like “open table format” and “open data lakehouse” are often used interchangeably, there is a growing need for clarity and a deeper technical understanding of what openness actually means in this context. In this session, we will do a technical breakdown of the lakehouse architecture & and examine what truly brings openness to a data platform.

Dipankar Mazumdar Photo
Dipankar Mazumdar
Director - Developers (Data/AI)
,  
Cloudera
,  
,  
,  
Watch now
Metadata as the Control Plane: The Foundation of an AI-Native Data Platform
Metadata as the Control Plane: The Foundation of an AI-Native Data Platform

As AI systems evolve from models to autonomous agents, data platforms need more than storage and compute—they need a control plane. This session explains why metadata is becoming that control plane and the foundation of an AI-native data platform. Using Apache Gravitino as an example, we’ll show how a metadata-centric architecture unifies multi-cloud, multi-format, and multi-engine stacks, connecting formats like Iceberg, Hudi, and Lance with engines such as Spark, Trino, Ray, and Daft. You’ll learn how this approach enables AI agents and modern data platforms to safely discover, govern, and access data at scale.

Junping (JD) Du Photo
Junping (JD) Du
CoFounder & CEO
,  
Datastrato
,  
,  
,  
Watch now
The Physics of LLM Inference at Scale
The Physics of LLM Inference at Scale

While many developers can download a model from Hugging Face, few grasp why latency spikes the moment concurrent users hit an endpoint. This talk explores the "physics" of LLM inference, emphasizing that engineering a production-grade service requires understanding hardware constraints rather than just adding more GPUs. We will analyze the fundamental split between the prefill phase, which is compute-bound (GEMM) due to parallel prompt processing, and the decode phase, which is severely memory-bound (GEMV) as it generates tokens one by one. Central to this is the KV Cache Crisis; while caching transforms generation from $O(n^2) to $O(n) per token, it introduces massive memory management challenges, often consuming more space than the model weights themselves.

To solve these bottlenecks, we will examine how continuous batching (iteration-level scheduling) and PagedAttention (block-based memory allocation) saturate GPUs by eliminating the "waiting waste" and fragmentation typical of static batching. We then bridge these single-GPU "physics" to distributed systems using Ray Serve, which provides the programmable infrastructure to manage heterogeneous replicas and offload CPU-bound work like tokenization to separate worker pools. The session concludes with a live demo featuring a production-ready deployment of Ray Serve and vLLM. Attendees will see how to move from single-node bottlenecks to a distributed, elastic system that maintains high throughput and low Inter-Token Latency (ITL) even under heavy concurrent load.

Suman Debnath Photo
Suman Debnath
Technical Lead (ML)
,  
Anyscale
,  
,  
,  
Watch now
Managing Data at Exabyte Scale for AI Model Training
Managing Data at Exabyte Scale for AI Model Training

More and more enterprises are post-training AI models to be custom tailored for their domain and their users. Success in model training depends critically on establishing the right data flywheel to improve iteration speed. From exploration to curation, from feature engineering to training, current data infrastructure solutions are insufficient to support the full iteration loop. Instead, model development teams are wrestling with a hodge podge of single purpose tools and having to copy different parts of their data to different systems for different steps in their workflow. This is one of the most important reasons why so many enterprise AI model training efforts are delayed or fail. For this talk, we’ll cover how LanceDB solves this problem by managing all of your multimodal training data in a unified table and streamlining the entire exploration to GPU-loading path. Via customer case studies, we’ll outline what this new architecture looks like, practical recipes for success, and how it all fits into your existing enterprise data architecture.

Chang She Photo
Chang She
CEO
,  
LanceDB
,  
,  
,  
Watch now

Workshop

3:35PM
  –  
4:00PM
 PST
Open Data using Onehouse Cloud
If you've ever tried to build a data lakehouse, you know it's no small task. You've got to tie...
3:35PM
  –  
4:00PM
 PST
Open Data using Onehouse Cloud

If you've ever tried to build a data lakehouse, you know it's no small task. You've got to tie together file formats, table formats, storage platforms, catalogs, compute, and more. But what if there was an easy button?

Join this session to see how Onehouse delivers the Universal Data Lakehouse that is:

Fast - Ingest and incrementally process data from stream, operational databases, and cloud storage with minute-level data freshness.

Efficient - Innovative optimizations ensure that you squeeze every bit of performance out of your resources with a runtime optimized for lakehouse workloads.

Simple - Onehouse is delivered as a fully managed cloud sevice, so you can spin up a production-ready lakehouse in days--or less.

The session will include a live demo. Attendees will be elible for up to $1,000 in free credits to try Onehouse for their organization.

Chandra Krishnan
Solutions Engineer
,  
Onehouse
Watch now

Full agenda coming soon!

Register Now

Secure your spot at the premier data practitioner event! Don’t miss out on expert insights, hands-on workshops, and networking opportunities.

Apply to speak to our next edition

Are you a data practitioner with real-world lessons from designing, building, or operating the open data stack? We’d love to hear from you.
Themes we’re excited about:
  • AI-native data platforms
  • Data engineering for AI
  • Cost and performance optimization at scale
  • Maximizing openness and interoperability in your data stack