Emanuele Vivoli
Emanuele Vivoli is a PhD student jointly at Computer Vision Center (UAB, Barcelona) and MICC (UNIFI, Italy), where he works on vision and language, particularly on Comics/Manga, supervised by Marco Bertini and Dimosthenis Karatzas.
In revers order:
He started his PhD in November 2022 in Florence, and October 2023 in Barcelona.
Previously, he interned in 2022 the Computer Vision Center (UAB, Barcelona), working in Multilingua Scene-Text VQA.
From 2021 to 2022 he worked as researcher in the AILab (UNIFI, Italy), supervised by Simone Marinai, working on Document and Table Recognition.
Finally, he interned for a research stay in EISLAB (Luleå Technique University) in 2019 supervised by Marcus Liwicki, working on EEG and RNNs.
He published in conferences such as NeurIPS, ECCV, BMVC, ICDAR, ICPR, IRCLD, ACM DocEng. He served as reviewer for ECCV, ICCV, CVPR, ACM Multimedia, ICDAR, and IJDAR.
He has worked also on Landmine detection for saving lifes and humanitarian demining. Check his Google Scholar for more info.
Email /
CV /
Bio /
Scholar /
Twitter /
Github
|
|
Research
I'm interested in Vision and Languages, more specifically in Comics/Manga. My works are mainly about Comics Understanding, bridging the gap between Comic medium and Vision and Language. Some papers are highlighted.
|
|
ComiCap: A VLMs pipeline for dense captioning of Comic Panels
Emanuele Vivoli,
Niccolò Biondi,
Marco Bertini,
Dimosthenis Karatzas
ECCV (workshop) AI4VA, 2024
This work proposes a Vision-Language Model pipeline to generate dense, grounded captions for comic panels, outperforming task-specific models without additional training and used to annotate over 2 million panels, enhancing comic understanding.
|
|
One missing piece in Vision and Language: A Survey on Comics Understanding
Emanuele Vivoli,
Andrey Barsky,
Mohammed Ali Soubgui,
Artemis Llabrés,
Marco Bertini,
Dimosthenis Karatzas
under review, 2024
This survey reviews comics understanding through the lens of vision-language models, introducing a new taxonomy, the Layer of Comics Understanding (LoCU), to redefine tasks in comics research, analyze datasets, and highlight challenges and future directions for AI applications in comics' unique visual-textual narratives.
|
|
Towards Generative Class Prompt Learning for Fine-grained Visual Recognition
Soumitri Chattopadhyay,
Sanket Biswas,
Emanuele Vivoli,
Josep Lladós
BMVC, 2024   (oral)
This paper proposes GCPL and CoMPLe to improve fine-grained categorization in vision-language models using generative modeling and contrastive learning, outperforming existing few-shot recognition methods.
|
|
CoMix: A Comprehensive Benchmark for Multi-Task Comic Understanding
Emanuele Vivoli,
Marco Bertini,
Dimosthenis Karatzas
NeurIPS (D&B), 2024
CoMix is a multi-task comic analysis benchmark covering object detection, character identification, and multi-modal reasoning, using diverse datasets to balance styles beyond manga. It evaluates models in zero-shot and fine-tuning settings, revealing a gap between human and model performance, and sets a new standard for comprehensive comic understanding.
|
|
Comics Datasets Framework: Mix of Comics datasets for detection benchmarking
Emanuele Vivoli,
Irene Campaioli,
Mariateresa Nardoni,
Niccolò Biondi,
Marco Bertini,
Dimosthenis Karatzas
ICDAR (workshop) MANPU, 2024   (oral)
This work introduces the Comics Datasets Framework, standardizing annotations and addressing manga overrepresentation with the Comics100 dataset. It benchmarks detection architectures to tackle challenges like small datasets and inconsistent annotations, aiming to improve object detection and support more complex computational tasks in comics.
|
|
Multimodal Transformer for Comics Text-Cloze
Emanuele Vivoli*,
Joan Lafuente Baeza*,
Ernest Valveny Llobet,
Dimosthenis Karatzas
ICDAR, 2024   (oral)
This work introduces a Multimodal-LLM for a comics text-cloze task, improving accuracy by 10% over existing models through a domain-adapted visual encoder and new OCR annotations, and extends the task to a generative format.
|
|
Deep-learning for dysgraphia detection in children handwritings
Andrea Gemelli*,
Emanuele Vivoli*,
Simone Marinai,
Tamara Zappaterra
ACM DocEng, 2023   (oral)
This paper proposes a smart pen and deep learning-based approach for early dysgraphia detection in children, offering a faster and more objective alternative to the traditional BHK test, validated through handwriting samples and expert interviews.
|
|
CTE: A Dataset for Contextualized Table Extraction
Andrea Gemelli*,
Emanuele Vivoli*,
Simone Marinai
IRCDL, 2023
This paper introduces Contextualized Table Extraction (CTE), a task to extract and structure tables within their document context, supported by a new dataset of 75k annotated pages from scientific papers, combining data from PubTables-1M and PubLayNet.
|
|
MUST-VQA: MUltilingual Scene-Text VQA
Emanuele Vivoli,
Ali Furkan Biten,
Andres Mafla,
Dimosthenis Karatzas,
Lluis Gomez
ECCV (workshop) Text in Everything, 2022
This paper introduces a framework for Multilingual Scene Text Visual Question Answering (MUST-VQA) that handles new languages in a zero-shot setting, evaluates models in IID and zero-shot scenarios, and demonstrates the effectiveness of adapting multilingual language models for the STVQA task.
|
|
Graph Neural Networks and Representation Embedding for Table Extraction in PDF Documents
Andrea Gemelli*,
Emanuele Vivoli*,
Simone Marinai
ICPR, 2022
This work addresses the problem of table extraction in scientific papers using Graph Neural Networks with enriched representation embeddings to improve distinction between tables, cells, and headers, evaluated on a dataset combining PubLayNet and PubTables-1M.
|
|