In reverse order:
I started my PhD in November 2022 in Florence, and October 2023 in Barcelona.
Previously, I interned in 2022 at the Computer Vision Center (UAB, Barcelona), working in Multilingual Scene-Text VQA.
From 2021 to 2022 I worked as a researcher in the AILab (UNIFI, Italy), supervised by Simone Marinai, working on Document and Table Recognition.
Finally, I interned for a research stay at EISLAB (Luleå Technical University) in 2019 supervised by Marcus Liwicki, working on EEG and RNNs.
I have published in conferences such as NeurIPS, ECCV, BMVC, ICDAR, ICPR, IRCLD, ACM DocEng. I have served as a reviewer for NeurIPS, CVPR, ECCV, ICCV, BMVC, ACM Multimedia, ICDAR, and IJDAR.
I have also worked on Landmine detection for saving lives and humanitarian demining. Check my Google Scholar for more info.
I'm interested in Vision and Language, more specifically in Comics/Manga. My works are mainly about Comics Understanding, bridging the gap between the Comic medium and Vision and Language. Some papers are highlighted.
In this work, we introduce CoSMo, a multimodal transformer for page stream segmentation in comic books that establishes a new state-of-the-art by leveraging visual and textual features to accurately segment comic book pages into reading streams, outperforming significantly larger vision-language models on a newly curated dataset of over 20k annotated pages.
In this work, I introduce ComicsPAP, a large-scale benchmark for comic strip understanding with over 100k samples organized into 5 subtasks. Through this Pick-a-Panel framework, I evaluate state-of-the-art MLLMs and demonstrate their limitations in capturing sequential and contextual dependencies, while also proposing adaptations that achieve better performance than models 10x larger.
This work proposes a Vision-Language Model pipeline to generate dense, grounded captions for comic panels, outperforming task-specific models without additional training. I used it to annotate over 2 million panels, enhancing comic understanding.
In this survey, I review comics understanding through the lens of vision-language models, introducing a new taxonomy, the Layer of Comics Understanding (LoCU), to redefine tasks in comics research, analyze datasets, and highlight challenges and future directions for AI applications in comics' unique visual-textual narratives.
In this paper, I propose GCPL and CoMPLe to improve fine-grained categorization in vision-language models using generative modeling and contrastive learning, outperforming existing few-shot recognition methods.
CoMix is a multi-task comic analysis benchmark covering object detection, character identification, and multi-modal reasoning, using diverse datasets to balance styles beyond manga. I evaluate models in zero-shot and fine-tuning settings, revealing a gap between human and model performance, and set a new standard for comprehensive comic understanding.
In this work, I introduce the Comics Datasets Framework, standardizing annotations and addressing manga overrepresentation with the Comics100 dataset. I benchmark detection architectures to tackle challenges like small datasets and inconsistent annotations, aiming to improve object detection and support more complex computational tasks in comics.
In this work, I introduce a Multimodal-LLM for a comics text-cloze task, improving accuracy by 10% over existing models through a domain-adapted visual encoder and new OCR annotations, and extend the task to a generative format.
In this paper, I propose a smart pen and deep learning-based approach for early dysgraphia detection in children, offering a faster and more objective alternative to the traditional BHK test, validated through handwriting samples and expert interviews.
In this paper, I introduce Contextualized Table Extraction (CTE), a task to extract and structure tables within their document context, supported by a new dataset of 75k annotated pages from scientific papers, combining data from PubTables-1M and PubLayNet.
In this paper, I introduce a framework for Multilingual Scene Text Visual Question Answering (MUST-VQA) that handles new languages in a zero-shot setting, evaluates models in IID and zero-shot scenarios, and demonstrates the effectiveness of adapting multilingual language models for the STVQA task.
In this work, I address the problem of table extraction in scientific papers using Graph Neural Networks with enriched representation embeddings to improve distinction between tables, cells, and headers, evaluated on a dataset combining PubLayNet and PubTables-1M.
Thanks to Jon Barron for the website's source code.