Name: Vision Language Models
Price: 167.49 NZD
Availability: PreOrder
ISBN: 9798341624047

About The Author

Merve Noyan

Merve Noyan is a machine learning engineer working in the ML advocacy engineering team at Hugging Face. She builds tools to enable people to build with vision language models across the Hugging Face ecosystem (transformers, TRL, smolagents). Previously she has worked in different companies building natural language understanding based solutions on information retrieval and conversational agents.

Andrés Marafioti

Andrés Marafioti holds a PhD in applied machine learning, with a focus on multimodal generative methods. Previously a senior ML engineer at Unity, he played a key role in bringing multimodal ML-based products from concept to market adoption. Now at Hugging Face, Andrés leads cutting-edge research in multimodal and memory-efficient models, leading the development of SmolVLM, a state-of-the-art vision-language model. He has co-authored several impactful papers in the VLM space, such as “Building and Better Understanding Vision-Language Models.”

Miquel Farré

Miquel Farré is a video technology expert with over 15 years of experience and more than 60 patents in machine learning and information science. His career began at the Fraunhofer Institute, where he designed advanced video codecs, and Nagravision, where he developed video streaming security modules. Transitioning to video understanding, Miquel joined Disney to architect the enterprise content metadata platform, leading machine learning initiatives across Pixar, Marvel, Lucasfilm, ABC, and ESPN. He then moved to YouTube, driving search monetization before expanding his focus to lead monetization for the platform’s Home and Watch Next surfaces. Before joining Hugging Face to work on video multimodal large language models, Miquel founded Arbro AI to build automated farming solutions.

Orr Zohar

Orr Zohar is a PhD candidate in SVL at Stanford University. His research centers on large multimodal models, particularly in video understanding, with a focus on self-training methodologies and agentic design. Orr has co-developed innovative approaches such as Video-STaR, a self-training method for video instruction tuning, and VideoAgent, an agent-based framework for long-form video comprehension. Notably, he led the Apollo project, a comprehensive study exploring video understanding in large multimodal models, resulting in the creation of the Apollo family of models that set new benchmarks in the field.

Returns

This item is eligible for free returns within 30 days of delivery. See our returns policy for further details.

ISBN-13:	9798341624047
Author:	Merve Noyan, Andrés Marafioti, Miquel Farré, Orr Zohar
Publisher:	O'Reilly Media
Imprint:	O'Reilly Media
Format:	Paperback
Number of Pages:	300
Release Date:	1 September 2026

Vision Language Models

Building Vlms with Hugging Face

$167.49

Summary

Book Details

You Can Find This Book In

About The Author

Merve Noyan

Merve Noyan

Andrés Marafioti

Miquel Farré

Orr Zohar

Returns

More from Merve Noyan

More from Andrés Marafioti

More in Human-computer interaction

More in Computer Science