Loading…
September 19-21, 2023
Bilbao, Spain
View More Details & Registration
Note: The schedule is subject to change.

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for Open Source Summit Europe 2023 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

This schedule is automatically displayed in Central European Summer Time (UTC/GMT +2). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date."

IMPORTANT NOTE: Timing of sessions and room locations are subject to change.

Back To Schedule
Thursday, September 21 • 15:55 - 16:35
Overcoming I/O Bottlenecks in LLM Training with Open-Source Distributed Caching - Lu Qiu & Jasmine Wang, Alluxio

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.


Large Language Model (LLM) training is a resource-intensive process, necessitating significant storage, CPU, and GPU resources, along with frequent I/O of numerous small files. As LLMs grow more complex, the demand for high-performance, scalable data processing solutions increases, especially in the context of distributed cloud training. Traditional data platform architectures struggle to maintain the needed I/O throughput, resulting in underutilized GPUs and inefficient resource usage. In this technical presentation, Lu Qiu & Jasmine Wang will dive deep into an innovative open-source architecture that optimizes I/O throughout the model training pipeline, ensuring adequate throughput for GPU workloads. She will explore the integration of a distributed caching system with TensorFlow/PyTorch workloads running in the cloud and discuss techniques for overcoming I/O challenges in model training. This architecture improves performance, reduces metadata latency, and enables high GPU utilization. Participants will gain insights into implementing this architecture for enhanced resource efficiency and adaptability to real-time workloads. They will also learn from practical, real-world examples, including Zhihu (China's Quora), illustrating the benefits of this open-source approach.

Speakers
avatar for Jasmine Wang

Jasmine Wang

Head of Community & Developer Relations, Alluxio
Jasmine Wang is the Head of Community and DevRel at Alluxio. She is a former national debate champion who turned into a traveling yoga teacher with a strong passion in building teams and being the bridge at early startups in Silicon Valley. Previously, she worked as the Head of Global... Read More →
avatar for Lu Qiu

Lu Qiu

Machine Learning Engineer, Alluxio
Lu Qiu is a machine learning engineer at Alluxio and is a PMC maintainer of the open source project Alluxio. Lu develops big data solutions for AI/ML training. Before that, Lu was responsible for core Alluxio components including leader election, journal management, and metrics management... Read More →



Thursday September 21, 2023 15:55 - 16:35 CEST
Room 0D-2-0D-3 (Floor 0)
  Open AI & Data Forum
  • Presentation Slides Attached Yes