
Vision Language Models
Building Vlms with Hugging Face
$167.49
- Paperback
300 pages
- Release Date
1 September 2026
Summary
Vision language models (VLMs) combine computer vision and natural language processing to create powerful systems that can interpret, generate, and respond in multimodal contexts. Vision Language Models is a hands-on guide to building real-world VLMs using the most up-to-date stack of machine learning tools from Hugging Face, Meta (PyTorch), NVIDIA (Cuda), OpenAI (CLIP), and others, written by leading researchers and practitioners Merve Noyan, Miquel Farré, Andrés Marafioti, and Orr…
Book Details
| ISBN-13: | 9798341624047 |
|---|---|
| Author: | Merve Noyan, Andrés Marafioti, Miquel Farré, Orr Zohar |
| Publisher: | O'Reilly Media |
| Imprint: | O'Reilly Media |
| Format: | Paperback |
| Number of Pages: | 300 |
| Release Date: | 1 September 2026 |
About The Author
Merve Noyan
Merve Noyan
Merve Noyan is a machine learning engineer working in the ML advocacy engineering team at Hugging Face. She builds tools to enable people to build with vision language models across the Hugging Face ecosystem (transformers, TRL, smolagents). Previously she has worked in different companies building natural language understanding based solutions on information retrieval and conversational agents.
Andrés Marafioti
Andrés Marafioti holds a PhD in applied machine learning, with a focus on multimodal generative methods. Previously a senior ML engineer at Unity, he played a key role in bringing multimodal ML-based products from concept to market adoption. Now at Hugging Face, Andrés leads cutting-edge research in multimodal and memory-efficient models, leading the development of SmolVLM, a state-of-the-art vision-language model. He has co-authored several impactful papers in the VLM space, such as “Building and Better Understanding Vision-Language Models.”
Miquel Farré
Miquel Farré is a video technology expert with over 15 years of experience and more than 60 patents in machine learning and information science. His career began at the Fraunhofer Institute, where he designed advanced video codecs, and Nagravision, where he developed video streaming security modules. Transitioning to video understanding, Miquel joined Disney to architect the enterprise content metadata platform, leading machine learning initiatives across Pixar, Marvel, Lucasfilm, ABC, and ESPN. He then moved to YouTube, driving search monetization before expanding his focus to lead monetization for the platform’s Home and Watch Next surfaces. Before joining Hugging Face to work on video multimodal large language models, Miquel founded Arbro AI to build automated farming solutions.
Orr Zohar
Orr Zohar is a PhD candidate in SVL at Stanford University. His research centers on large multimodal models, particularly in video understanding, with a focus on self-training methodologies and agentic design. Orr has co-developed innovative approaches such as Video-STaR, a self-training method for video instruction tuning, and VideoAgent, an agent-based framework for long-form video comprehension. Notably, he led the Apollo project, a comprehensive study exploring video understanding in large multimodal models, resulting in the creation of the Apollo family of models that set new benchmarks in the field.
Returns
This item is eligible for free returns within 30 days of delivery. See our returns policy for further details.




