Research

I am interested in developing algorithms and wearable systems based on egocentric vision to support users in their daily tasks, be them related to home/personal scenarios or work-related scenarios. I have been working on egocentric (or first-person) vision since the beginning of my PhD (2013) and developed experience on data collection and labeling, the definition of tasks, the development of algorithms, as well as their evaluation.

Research Highlights

This section highlights recent research aligned to my main research interests. Please see the publications page for a full list of publications.

A list of recent talks can be found here.

Gazing Into Missteps: Leveraging Eye-Gaze for Unsupervised Mistake Detection in Egocentric Videos of Skilled Human Activities

        @inproceedings{mazzamuto2025gazing, author = {Mazzamuto, Michele and Furnari, Antonino and Sato, Yoichi and Farinella, Giovanni Maria}, booktitle = {IEE/CVF Conference on Computer Vision and Patter Recognition}, title = {Gazing Into Missteps: Leveraging Eye-Gaze for Unsupervised Mistake Detection in Egocentric Videos of Skilled Human Activities}, pdf = {https://arxiv.org/pdf/2406.08379.pdf}, year = {2025} }

We address the challenge of unsupervised mistake detection in egocentric video of skilled human activities through the analysis of gaze signals. While traditional methods rely on manually labeled mistakes, our approach does not require mistake annotations, hence overcoming the need of domain-specific labeled data. Based on the observation that eye movements closely follow object manipulation activities, we assess to what extent eye-gaze signals can support mistake detection, proposing to identify deviations in attention patterns measured through a gaze tracker with respect to those estimated by a gaze prediction model. Since predicting gaze in video is characterized by high uncertainty, we propose a novel gaze completion task, where eye fixations are predicted from visual observations and partial gaze trajectories, and contribute a novel gaze completion approach which explicitly models correlations between gaze information and local visual tokens. Inconsistencies between predicted and observed gaze trajectories act as an indicator to identify mistakes. Experiments highlight the effectiveness of the proposed approach in different settings, with relative gains up to +14%, +11%, and +5% in EPIC-Tent, HoloAssist and IndustReal respectively, remarkably matching results of supervised approaches without seeing any labels. We further show that gaze-based analysis is particularly useful in the presence of skilled actions, low action execution confidence, and actions requiring hand-eye coordination and object manipulation skills. Our method is ranked first on the HoloAssist Mistake Detection challenge.

Differentiable Task Graph Learning: Procedural Activity Representation and Online Mistake Detection from Egocentric Videos

        @inproceedings{seminara2024differentiable, author = {Seminara, Luigi and Farinella, Giovanni Maria and Furnari, Antonino}, booktitle = {Advances in Neural Information Processing Systems}, title = {Differentiable Task Graph Learning: Procedural Activity Representation and Online Mistake Detection from Egocentric Videos}, pdf = {https://arxiv.org/pdf/2406.01486.pdf}, url = {https://github.com/fpv-iplab/Differentiable-Task-Graph-Learning}, year = {2024} }

Procedural activities are sequences of key-steps aimed at achieving specific goals. They are crucial to build intelligent agents able to assist users effectively. In this context, task graphs have emerged as a human-understandable representation of procedural activities, encoding a partial ordering over the key-steps. While previous works generally relied on hand-crafted procedures to extract task graphs from videos, in this paper, we propose an approach based on direct maximum likelihood optimization of edges’ weights, which allows gradient-based learning of task graphs and can be naturally plugged into neural network architectures. Experiments on the CaptainCook4D dataset demonstrate the ability of our approach to predict accurate task graphs from the observation of action sequences, with an improvement of +16.7% over previous approaches. Owing to the differentiability of the proposed framework, we also introduce a feature-based approach, aiming to predict task graphs from key-step textual or video embeddings, for which we observe emerging video understanding abilities. Task graphs learned with our approach are also shown to significantly enhance online mistake detection in procedural egocentric videos, achieving notable gains of +19.8% and +7.5% on the Assembly101 and EPIC-Tent datasets. Code for replicating experiments will be publicly released.

AFF-ttention! Affordances and Attention models for Short-Term Object Interaction Anticipation

        @inproceedings{mur-labadia2024AFF-ttention,
 pdf = { https://arxiv.org/pdf/2406.01194.pdf },
 year = { 2024 },
 booktitle = { European Conference on Computer Vision (ECCV) },
 title = { AFF-ttention! Affordances and Attention models for Short-Term Object Interaction Anticipation },
 author = { Lorenzo Mur-Labadia and Ruben Martinez-Cantin and Josechu Guerrero and Giovanni Maria Farinella and Antonino Furnari },
 url = {https://github.com/lmur98/AFFttention}
 }

Short-Term object-interaction Anticipation consists of detecting the location of the next-active objects, the noun and verb categories of the interaction, and the time to contact from the observation of egocentric video. This ability is fundamental for wearable assistants or human robot interaction to understand the user goals, but there is still room for improvement to perform STA in a precise and reliable way. In this work, we improve the performance of STA predictions with two contributions: 1. We propose STAformer, a novel attention-based architecture integrating frame guided temporal pooling, dual image-video attention, and multiscale feature fusion to support STA predictions from an image-input video pair. 2. We introduce two novel modules to ground STA predictions on human behavior by modeling affordances.First, we integrate an environment affordance model which acts as a persistent memory of interactions that can take place in a given physical scene. Second, we predict interaction hotspots from the observation of hands and object trajectories, increasing confidence in STA predictions localized around the hotspot. Our results show significant relative Overall Top-5 mAP improvements of up to +45% on Ego4D and +42% on a novel set of curated EPIC-Kitchens STA labels. We will release the code, annotations, and pre extracted affordances on Ego4D and EPIC- Kitchens to encourage future research in this area.

Synchronization is All You Need: Exocentric-to-Egocentric Transfer for Temporal Action Segmentation with Unlabeled Synchronized Video Pairs

        @inproceedings{quattrocchi2024synchronization, pdf = { https://arxiv.org/pdf/2312.02638.pdf }, year = { 2024 }, booktitle = { European Conference on Computer Vision (ECCV) }, title = { Synchronization is All You Need: Exocentric-to-Egocentric Transfer for Temporal Action Segmentation with Unlabeled Synchronized Video Pairs }, author = { Camillo Quattrocchi and Antonino Furnari and Daniele Di Mauro and Mario Valerio Giuffrida and Giovanni Maria Farinella }, url = {https://github.com/fpv-iplab/synchronization-is-all-you-need} }

We consider the problem of transferring a temporal action segmentation system initially designed for exocentric (fixed) cameras to an egocentric scenario, where wearable cameras capture video data. The conventional supervised approach requires the collection and labeling of a new set of egocentric videos to adapt the model, which is costly and time-consuming. Instead, we propose a novel methodology which performs the adaptation leveraging existing labeled exocentric videos and a new set of unlabeled, synchronized exocentric-egocentric video pairs, for which temporal action segmentation annotations do not need to be collected. We implement the proposed methodology with an approach based on knowledge distillation, which we investigate both at the feature and Temporal Action Segmentation model level. Experiments on Assembly101 and EgoExo4D demonstrate the effectiveness of the proposed method against classic unsupervised domain adaptation and temporal alignment approaches. Without bells and whistles, our best model performs on par with supervised approaches trained on labeled egocentric data, without ever seeing a single egocentric label, achieving a +15.99 improvement in the edit score (28.59 vs 12.60) on the Assembly101 dataset compared to a baseline model trained solely on exocentric data. In similar settings, our method also improves edit score by +3.32 on the challenging EgoExo4D benchmark. Web page

Are Synthetic Data Useful for Egocentric Hand-Object Interaction Detection?

        @inproceedings{leonardi2024synthetic, pdf = { https://arxiv.org/pdf/2312.02672.pdf }, year = { 2024 }, booktitle = { European Conference on Computer Vision (ECCV) }, title = { Are Synthetic Data Useful for Egocentric Hand-Object Interaction Detection? }, author = { Rosario Leonardi and Antonino Furnari and Francesco Ragusa and Giovanni Maria Farinella }, url = {https://github.com/fpv-iplab/HOI-Synth} }

In this study, we investigate the effectiveness of synthetic data in enhancing egocentric hand-object interaction detection. Via extensive experiments and comparative analyses on three egocentric datasets, VISOR, EgoHOS, and ENIGMA-51, our findings reveal how to exploit synthetic data for the HOI detection task when real labeled data are scarce or unavailable. Specifically, by leveraging only 10% of real labeled data, we achieve improvements in Overall AP compared to baselines trained exclusively on real data of: +5.67% on EPIC-KITCHENS VISOR, +8.24% on EgoHOS, and +11.69% on ENIGMA-51. Our analysis is supported by a novel data generation pipeline and the newly introduced HOI-Synth benchmark which augments existing datasets with synthetic images of hand-object interactions automatically labeled with hand-object contact states, bounding boxes, and pixel-wise segmentation masks. We publicly release the generated data, code, and data generation tools to support future research. Web page

PREGO: online mistake detection in PRocedural EGOcentric videos

        @inproceedings{flaborea2024PREGO, year = { 2024 }, booktitle = {  Conference on Computer Vision and Pattern Recognition (CVPR)  }, title = {  PREGO: online mistake detection in PRocedural EGOcentric videos  }, author = { Alessandro Flaborea and Guido D'Amely and Leonardo Plini and Luca Scofano and Edoardo De Matteis and Antonino Furnari and Giovanni Maria Farinella and Fabio Galasso }, pdf={https://arxiv.org/pdf/2404.01933.pdf}, url={https://github.com/aleflabo/PREGO?tab=readme-ov-file} }

Promptly identifying procedural errors from egocentric videos in an online setting is highly challenging and valuable for detecting mistakes as soon as they happen. This capability has a wide range of applications across various fields, such as manufacturing and healthcare. The nature of procedural mistakes is open-set since novel types of failures might occur, which calls for one-class classifiers trained on correctly executed procedures. However, no technique can currently detect open-set procedural mistakes online. We propose PREGO, the first online one-class classification model for mistake detection in PRocedural EGOcentric videos. PREGO is based on an online action recognition component to model the current action, and a symbolic reasoning module to predict the next actions. Mistake detection is performed by comparing the recognized current action with the expected future one. We evaluate PREGO on two procedural egocentric video datasets, Assembly101 and Epic-tent, which we adapt for online benchmarking of procedural mistake detection to establish suitable benchmarks, thus defining the Assembly101-O and Epic-tent-O datasets, respectively. Web Page

The Ego-Exo4D Dataset

        @inproceedings{grauman2023egoexo4d, primaryclass = { cs.CV }, archiveprefix = { arXiv }, eprint = { 2311.18259 }, pdf = { https://arxiv.org/pdf/2311.18259.pdf }, url = { https://ego-exo4d-data.org/ }, year = { 2024 }, booktitle = {  Conference on Computer Vision and Pattern Recognition (CVPR)  }, title = { Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives }, author = { Kristen Grauman and Andrew Westbury and Lorenzo Torresani and Kris Kitani and Jitendra Malik and Triantafyllos Afouras and Kumar Ashutosh and Vijay Baiyya and Siddhant Bansal and Bikram Boote and Eugene Byrne and Zach Chavis and Joya Chen and Feng Cheng and Fu-Jen Chu and Sean Crane and Avijit Dasgupta and Jing Dong and Maria Escobar and Cristhian Forigua and Abrham Gebreselasie and Sanjay Haresh and Jing Huang and Md Mohaiminul Islam and Suyog Jain and Rawal Khirodkar and Devansh Kukreja and Kevin J Liang and Jia-Wei Liu and Sagnik Majumder and Yongsen Mao and Miguel Martin and Effrosyni Mavroudi and Tushar Nagarajan and Francesco Ragusa and Santhosh Kumar Ramakrishnan and Luigi Seminara and Arjun Somayazulu and Yale Song and Shan Su and Zihui Xue and Edward Zhang and Jinxu Zhang and Angela Castillo and Changan Chen and Xinzhu Fu and Ryosuke Furuta and Cristina Gonzalez and Prince Gupta and Jiabo Hu and Yifei Huang and Yiming Huang and Weslie Khoo and Anush Kumar and Robert Kuo and Sach Lakhavani and Miao Liu and Mi Luo and Zhengyi Luo and Brighid Meredith and Austin Miller and Oluwatumininu Oguntola and Xiaqing Pan and Penny Peng and Shraman Pramanick and Merey Ramazanova and Fiona Ryan and Wei Shan and Kiran Somasundaram and Chenan Song and Audrey Southerland and Masatoshi Tateno and Huiyu Wang and Yuchen Wang and Takuma Yagi and Mingfei Yan and Xitong Yang and Zecheng Yu and Shengxin Cindy Zha and Chen Zhao and Ziwei Zhao and Zhifan Zhu and Jeff Zhuo and Pablo Arbelaez and Gedas Bertasius and David Crandall and Dima Damen and Jakob Engel and Giovanni Maria Farinella and Antonino Furnari and Bernard Ghanem and Judy Hoffman and C. V. Jawahar and Richard Newcombe and Hyun Soo Park and James M. Rehg and Yoichi Sato and Manolis Savva and Jianbo Shi and Mike Zheng Shou and Michael Wray }, }

Ego-Exo4D presents three meticulously synchronized natural language datasets paired with videos. (1) expert commentary, revealing nuanced skills. (2) participant-provided narrate-and-act descriptions in a tutorial style. (3) one-sentence atomic action descriptions to support browsing, mining the dataset, and addressing challenges in video-language learning. Our goal is to capture simultaneous ego and multiple exo videos, together with multiple egocentric sensing modalities. Our camera configuration features Aria glasses for ego capture, including an 8 MP RGB camera and two SLAM cameras. The ego camera is calibrated and time-synchronized with 4-5 (stationary) GoPros as the exo capture devices. The number and placement of the exocentric cameras is determined per scenario in order to allow maximal coverage of useful viewpoints without obstructing the participants’ activity. Apart from multiple views, we also capture multiple modalities. Along with the dataset, we introduce four benchmarks. The recognition benchmark aims to recognize individual keysteps and infer their relation in the execution of procedural activities. The proficiency estimation benchmark aims to estimate the camera wearer’s skills. The relation benchmark focuses on methods to establish spatial relationships between synchronized multi-view frames. The pose estimation benchmarks concerns the estimation of the camera pose of the camera wearer. Web Page

Action Scene Graphs for Long-Form Understanding of Egocentric Videos

        @inproceedings{rodin2023action, primaryclass = { cs.CV }, archiveprefix = { arXiv }, eprint = { 2312.03391 }, pdf = { https://arxiv.org/pdf/2312.03391.pdf }, year = { 2024 }, booktitle = {  Conference on Computer Vision and Pattern Recognition (CVPR)  }, title = { Action Scene Graphs for Long-Form Understanding of Egocentric Videos }, author = { Ivan Rodin and Antonino Furnari and Kyle Min and Subarna Tripathi and Giovanni Maria Farinella }, url = {https://github.com/fpv-iplab/EASG} }

We present Egocentric Action Scene Graphs (EASGs), a new representation for long-form understanding of egocentric videos. EASGs extend standard manually-annotated representations of egocentric videos, such as verb-noun action labels, by providing a temporally evolving graphbased description of the actions performed by the camera wearer, including interacted objects, their relationships, and how actions unfold in time. Through a novel annotation procedure, we extend the Ego4D dataset by adding manually labeled Egocentric Action Scene Graphs offering a rich set of annotations designed for long-from egocentric video understanding. We hence define the EASG generation task and provide a baseline approach, establishing preliminary benchmarks. Experiments on two downstream tasks, egocentric action anticipation and egocentric activity summarization, highlight the effectiveness of EASGs for long-form egocentric video understanding. Web Page

An Outlook into the Future of Egocentric Vision

        @article{Plizzari2024AnOutlook, author = { Chiara Plizzari and Gabriele Goletto and Antonino Furnari and Siddhant Bansal and Francesco Ragusa and Giovanni Maria Farinella and Dima Damen and Tatiana Tommasi }, journal = {  International Journal of Computer Vision (IJCV)  }, title = {  An Outlook into the Future of Egocentric Vision  }, year = { 2024 }, url = { https://link.springer.com/article/10.1007/s11263-024-02095-7 }, pdf = {  https://link.springer.com/content/pdf/10.1007/s11263-024-02095-7.pdf  }, doi = {    }, }

In this survey, we explore the gap between current research in egocentric vision and the ever-anticipated future, where wearable computing, with outward facing cameras and digital overlays, is expected to be integrated in our every day lives. To understand this gap, the article starts by envisaging the future through character-based stories, showcasing through examples the limitations of current technology. We then provide a mapping between this future and previously defined research tasks. For each task, we survey its seminal works, current state-of-the-art methodologies and available datasets, then reflect on shortcomings that limit its applicability to future research. Note that this survey focuses on software models for egocentric vision, independent of any specific hardware. The paper concludes with recommendations for areas of immediate explorations so as to unlock our path to the future always-on, personalised and life-enhancing egocentric vision.

StillFast: An End-to-End Approach for Short-Term Object Interaction Anticipation

        @InProceedings{ragusa2023stillfast, author={Francesco Ragusa and Giovanni Maria Farinella and Antonino Furnari}, title={StillFast: An End-to-End Approach for Short-Term Object Interaction Anticipation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, year      = {2023}, pdf={https://arxiv.org/pdf/2304.03959.pdf}, url={https://iplab.dmi.unict.it/stillfast/} }

Anticipation problems have been studied considering different aspects such as predicting humans’ locations, predicting hands and objects trajectories, and forecasting actions and human-object interactions. In this paper, we studied the short-term object interaction anticipation problem from the egocentric point of view, proposing a new end-to-end architecture named StillFast. Our approach simultaneously processes a still image and a video detecting and localizing next-active objects, predicting the verb which describes the future interaction and determining when the interaction will start. Experiments on the large-scale egocentric dataset EGO4D show that our method outperformed state-of-the-art approaches on the considered task. Our method is ranked first in the public leaderboard of the EGO4D short term object interaction anticipation challenge 2022. Web Page

EPIC-KITCHENS-100 DATASET

        @article{Damen2021rescaling, title = {Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100}, author = {Damen, Dima and Doughty, Hazel and Farinella, Giovanni Maria and Furnari, Antonino and Kazakos, Evangelos and Ma, Jian and Moltisanti, Davide and Munro, Jonathan and Perrett, Toby and Price, Will and Wray, Michael}, journal = {International Journal on Computer Vision (IJCV)}, volume = {130}, pages={33-55}, year = {2022}, url = {http://epic-kitchens.github.io/2020-100}, pdf = {http://arxiv.org/pdf/2006.13256.pdf}, }

We introduce EPIC-KITCHENS-100, the largest annotated egocentric dataset - 100 hrs, 20M frames, 90K actions - of wearable videos capturing long-term unscripted activities in 45 environments. This extends our previous dataset (EPIC-KITCHENS-55), released in 2018, resulting in more action segments (+128%), environments (+41%) and hours (+84%), using a novel annotation pipeline that allows denser and more complete annotations of fine-grained actions (54% more actions per minute). We evaluate the “test of time” - i.e. whether models trained on data collected in 2018 can generalise to new footage collected under the same hypotheses albeit “two years on”. The dataset is aligned with 6 challenges: action recognition (full and weak supervision), detection, anticipation, retrieval (from captions), as well as unsupervised domain adaptation for action recognition. For each challenge, we define the task, provide baselines and evaluation metrics. The dataset has been released on 01/07/2020. Please watch the recorded webinar of our presentation for more information. Webinar Web Page

The Ego4D Dataset

        @inproceedings{grauman2022around, author = { Kristen Grauman and Andrew Westbury and Eugene Byrne and Zachary Chavis and Antonino Furnari and Rohit Girdhar and Jackson Hamburger and Hao Jiang and Miao Liu and Xingyu Liu and Miguel Martin and Tushar Nagarajan and Ilija Radosavovic and Santhosh Kumar Ramakrishnan and Fiona Ryan and Jayant Sharma and Michael Wray and Mengmeng Xu and Eric Zhongcong Xu and Chen Zhao and Siddhant Bansal and Dhruv Batra and Vincent Cartillier and Sean Crane and Tien Do and Morrie Doulaty and Akshay Erapalli and Christoph Feichtenhofer and Adriano Fragomeni and Qichen Fu and Christian Fuegen and Abrham Gebreselasie and Cristina Gonzalez and James Hillis and Xuhua Huang and Yifei Huang and Wenqi Jia and Weslie Khoo and Jachym Kolar and Satwik Kottur and Anurag Kumar and Federico Landini and Chao Li and Yanghao Li and Zhenqiang Li and Karttikeya Mangalam and Raghava Modhugu and Jonathan Munro and Tullie Murrell and Takumi Nishiyasu and Will Price and Paola Ruiz Puentes and Merey Ramazanova and Leda Sari and Kiran Somasundaram and Audrey Southerland and Yusuke Sugano and Ruijie Tao and Minh Vo and Yuchen Wang and Xindi Wu and Takuma Yagi and Yunyi Zhu and Pablo Arbelaez and David Crandall and Dima Damen and Giovanni Maria Farinella and Bernard Ghanem and Vamsi Krishna Ithapu and C. V. Jawahar and Hanbyul Joo and Kris Kitani and Haizhou Li and Richard Newcombe and Aude Oliva and Hyun Soo Park and James M. Rehg and Yoichi Sato and Jianbo Shi and Mike Zheng Shou and Antonio Torralba and Lorenzo Torresani and Mingfei Yan and Jitendra Malik }, title = {  Around the {W}orld in 3,000 {H}ours of {E}gocentric {V}ideo  }, booktitle = {  IEEE/CVF International Conference on Computer Vision and Pattern Recognition  }, year = {2022}, pdf = { https://arxiv.org/pdf/2110.07058.pdf }, url = { https://ego4d-data.org/ }, }

Ego4D is a massive-scale Egocentric dataset of unprecedented diversity. It consists of 3,670 hours of video collected by 923 unique participants from 74 worldwide locations in 9 different countries. The project brings together 88 researchers, in an international consortium, to dramatically increases the scale of egocentric data publicly available by an order of magnitude, making it more than 20x greater than any other data set in terms of hours of footage. Ego4D aims to catalyse the next era of research in first-person visual perception. The dataset is diverse in its geographic coverage, scenarios, participants and captured modalities. We consulted a survey from the U.S. Bureau of Labor Statistics that captures how people spend the bulk of their time. Data was captured using seven different off-the-shelf head-mounted cameras: GoPro, Vuzix Blade, Pupil Labs, ZShades, OR- DRO EP6, iVue Rincon 1080, and Weeview. In addition to video, portions of Ego4D offer other data modalities: 3D scans, audio, gaze, stereo, multiple synchronized wearable cameras, and textual narrations.

Reveal Session

Web Page

The MECCANO Dataset

        @inproceedings{ragusa2021meccano, pdf = { https://arxiv.org/pdf/2010.05654.pdf }, url = { https://iplab.dmi.unict.it/MECCANO }, primaryclass = { cs.CV }, booktitle={IEEE Winter Conference on Application of Computer Vision (WACV)}, eprint = { 2010.05654 }, year = {2021}, author = {Francesco Ragusa and Antonino Furnari and Salvatore Livatino and Giovanni Maria Farinella}, title = {The MECCANO Dataset: Understanding Human-Object Interactions from Egocentric Videos in an Industrial-like Domain} }

        @article{ragusa2023meccano, year = {2023}, title = {MECCANO: A Multimodal Egocentric Dataset for Humans Behavior Understanding in the Industrial-like Domain}, journal = {Computer Vision and Image Understanding (CVIU)}, author = {Francesco Ragusa and Antonino Furnari and Giovanni Maria Farinella}, url = {https://arxiv.org/abs/2209.08691} }

In this work, we introduce MECCANO, the first dataset of egocentric videos to study human-object interactions in industrial-like settings. MECCANO has been acquired by 20 participants who were asked to build a motorbike model, for which they had to interact with tiny objects and tools. The dataset has been explicitly labeled for the task of recognizing human-object interactions from an egocentric perspective. Specifically, each interaction has been labeled both temporally (with action segments) and spatially (with active object bounding boxes). With the proposed dataset, we investigate four different tasks including 1) action recognition, 2) active object detection, 3) active object recognition and 4) egocentric human-object interaction detection, which is a revisited version of the standard human-object interaction detection task. Baseline results show that the MECCANO dataset is a challenging benchmark to study egocentric human-object interactions in industrial-like scenarios. Web Page

Future Predictions From First-Person (Egocentric) Vision

        @article{rodin2021predicting, title={Predicting the Future from First Person (Egocentric) Vision: A Survey}, author={Ivan Rodin and Antonino Furnari and Dimitrios Mavroedis and Giovanni Maria Farinella}, year={2021}, volume = {211}, pages = {103252}, issn = {1077-3142}, doi = {https://doi.org/10.1016/j.cviu.2021.103252}, url = {https://www.sciencedirect.com/science/article/pii/S1077314221000965}, journal={Computer Vision and Image Understanding}, pdf={https://arxiv.org/pdf/2107.13411.pdf} }

Egocentric videos can bring a lot of information about how humans perceive the world and interact with the environment, which can be beneficial for the analysis of human behaviour. The research in egocentric video analysis is developing rapidly thanks to the increasing availability of wearable devices and the opportunities offered by new large-scale egocentric datasets. As computer vision techniques continue to develop at an increasing pace, the tasks related to the prediction of future are starting to evolve from the need of understanding the present. Predicting future human activities, trajectories and interactions with objects is crucial in applications such as human-robot interaction, assistive wearable technologies for both industrial and daily living scenarios, entertainment and virtual or augmented reality. This survey summarises the evolution of studies in the context of future prediction from egocentric vision making an overview of applications, devices, existing problems, commonly used datasets, models and input modalities. Our analysis highlights that methods for future prediction from egocentric vision can have a significant impact in a range of applications and that further research efforts should be devoted to the standardisation of tasks and the proposal of datasets considering real-world scenarios such as the ones with an industrial vocation.

Streaming Egocentric Action Anticipation

        @inproceedings{furnari2022towards, year = {2022}, booktitle = { International Conference on Pattern Recognition (ICPR) }, title = { Towards Streaming Egocentric Action Anticipation }, pdf = { https://arxiv.org/pdf/2110.05386.pdf }, author = { Antonino Furnari and Giovanni Maria Farinella } }

        @article{furnari2023streaming, doi = {https://doi.org/10.1016/j.cviu.2023.103763}, pdf = {https://arxiv.org/pdf/2306.16682.pdf}, url = {https://www.sciencedirect.com/science/article/pii/S1077314223001431?via%3Dihub}, year = {2023}, title = {Streaming egocentric action anticipation: an evaluation scheme and approach}, journal = {Computer Vision and Image Understanding (CVIU)}, author = {Antonino Furnari and Giovanni Maria Farinella}, }

Egocentric action anticipation is the task of predicting the future actions a camera wearer will likely perform based on past video observations. While in a real-world system it is fundamental to output such predictions before the action begins, past works have not generally paid attention to model runtime during evaluation. Indeed, current evaluation schemes assume that predictions can be made offline, and hence that computational resources are not limited. In contrast, in this paper, we propose a ``streaming’’ egocentric action anticipation evaluation protocol which explicitly considers model runtime for performance assessment, assuming that predictions will be available only after the current video segment is processed, which depends on the processing time of a method. Following the proposed evaluation scheme, we benchmark different state-of-the-art approaches for egocentric action anticipation on two popular datasets. Our analysis shows that models with a smaller runtime tend to outperform heavier models in the considered streaming scenario, thus changing the rankings generally observed in standard offline evaluations. Based on this observation, we propose a lightweight action anticipation model consisting in a simple feed-forward 3D CNN, which we propose to optimize using knowledge distillation techniques and a custom loss. The results show that the proposed approach outperforms prior art in the streaming scenario, also in combination with other lightweight models. Video presentation at ICPR 2022

EPIC-KITCHENS-55 DATASET

        @article{damen2020epic, author = {Dima Damen and Hazel Doughty and Giovanni Maria Farinella and Sanja Fidler and Antonino Furnari and Evangelos Kazakos and Davide Moltisanti and Jonathan Munro and Toby Perrett and Will Price and Michael Wray}, journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)}, title = {The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines}, url = {https://epic-kitchens.github.io/}, pdf = {https://arxiv.org/pdf/2005.00343.pdf}, year = {2020}, doi = {10.1109/TPAMI.2020.2991965} }

        @inproceedings{Damen2018EPICKITCHENS, year = {2018}, booktitle= { European Conference on Computer Vision }, author = { D. Damen and H. Doughty and G. M. Farinella and S. Fidler and A. Furnari and E. Kazakos and D. Moltisanti and J. Munro and T. Perrett and W. Price and M. Wray }, title = { Scaling Egocentric Vision: The EPIC-KITCHENS Dataset }, url={https://epic-kitchens.github.io/2018}, pdf={https://arxiv.org/pdf/1804.02748.pdf} }

We introduced EPIC-KITCHENS-55, a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen environments. Our videos depict nonscripted daily activities. Recording took place in 4 cities (in North America and Europe) by participants belonging to 10 different nationalities, resulting in highly diverse kitchen habits and cooking styles. Our dataset features 55 hours of video consisting of 11.5M frames, which we densely labeled for a total of 39.6K action segments and 454.2K object bounding boxes. We describe our object, action and anticipation challenges, and evaluate several baselines over two test splits, seen and unseen kitchens. This work is a joint collaboration between the University of Catania, the University of Bristol and the University of Toronto. Web Page

Egocentric Visitors Behavior Understanding in Cultural Sites

        @article{ragusa2020egoch, title = {EGO-CH: Dataset and Fundamental Tasks for Visitors Behavioral Understanding using Egocentric Vision}, journal = {Pattern Recognition Letters - Special Issue on Pattern Recognition and Artificial Intelligence Techniques for Cultural Heritage}, year = {2020}, pdf = {https://www.antoninofurnari.it/downloads/publications/ragusa2020egoch.pdf}, url = {https://iplab.dmi.unict.it/EGO-CH/}, author = {F. Ragusa and A. Furnari and S. Battiato and G. Signorello and G. M. Farinella}, }

        @article{orlando2020egocentric, author = {S. Orlando and A. Furnari and G. M. Farinella}, url = {https://iplab.dmi.unict.it/SimulatedEgocentricNavigations/}, pdf = {https://www.antoninofurnari.it/downloads/publications/orlando2020egocentric.pdf}, year = {2020}, journal = {Pattern Recognition Letters - Special Issue on Pattern Recognition and Artificial Intelligence Techniques for Cultural Heritage}, title = {Egocentric Visitor Localization and Artwork Detection inCultural Sites Using Synthetic Data}, }

        @article{milotta2019egocentric, pdf = {https://www.antoninofurnari.it/downloads/publications/milotta2019egocentric.pdf}, author = {Filippo L.M. Milotta and Antonino Furnari and Sebastiano Battiato and Giovanni Signorello and Giovanni M. Farinella}, url = {https://iplab.dmi.unict.it/EgoNature/}, doi = {https://doi.org/10.1016/j.jvcir.2019.102664}, issn = {1047-3203}, year = {2019}, pages = {102664}, journal = {Journal of Visual Communication and Image Representation}, title = {Egocentric Visitors Localization in Natural Sites}, }

        @article{ragusa2019egocentric, author = {F. Ragusa and A. Furnari and S. Battiato and G. Signorello and G. M. Farinella}, url = {http://iplab.dmi.unict.it/VEDI/}, pdf = {https://arxiv.org/pdf/1904.05264.pdf}, year = {2019}, journal = {Journal on Computing and Cultural Heritage (JOCCH)}, title = {Egocentric Visitors Localization in Cultural Sites}, volume = {12}, issue = {2}, doi = {https://doi.org/10.1145/3276772} }

We consider the problem of localizing visitors in a cultural site from egocentric (first person) images. Localization information can be useful both to assist the user during his visit (e.g., by suggesting where to go and what to see next) and to provide behavioral information to the manager of the cultural site (e.g., how much time has been spent by visitors at a given location? What has been liked most?). To tackle the problem, we collected a large dataset of egocentric videos using two cameras: a head-mounted HoloLens device and a chest-mounted GoPro. Each frame has been labeled according to the location of the visitor and to what he was looking at.The dataset is freely available in order to encourage research in this domain. The dataset is complemented with baseline experiments performed considering a state-of-the-art method for location-based temporal segmentation of egocentric videos. Experiments show that compelling results can be achieved to extract useful information for both the visitor and the site-manager. Web Page

Rolling-Unrolling LSTMs for Egocentric Action Anticipation

        @article{furnari2020rulstm, author = {Antonino Furnari and Giovanni Maria Farinella}, journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)}, title = {Rolling-Unrolling LSTMs for Action Anticipation from First-Person Video}, url = {https://iplab.dmi.unict.it/rulstm}, pdf = {https://arxiv.org/pdf/2005.02190.pdf}, year = {2020}, doi = {10.1109/TPAMI.2020.2992889} }

        @inproceedings{furnari2019rulstm, title = { What Would You Expect? Anticipating Egocentric Actions with Rolling-Unrolling LSTMs and Modality Attention }, author = { Antonino Furnari and Giovanni Maria Farinella }, year = { 2019 }, booktitle = { International Conference on Computer Vision }, pdf = {https://arxiv.org/pdf/1905.09035.pdf}, url = {http://iplab.dmi.unict.it/rulstm} }

Egocentric action anticipation consists in understanding which objects the camera wearer will interact with in the near future and which actions they will perform. We tackle the problem proposing an architecture able to anticipate actions at multiple temporal scales using two LSTMs to 1) summarize the past, and 2) formulate predictions about the future. The input video is processed considering three complimentary modalities: appearance (RGB), motion (optical flow) and objects (object-based features). Modality-specific predictions are fused using a novel Modality ATTention (MATT) mechanism which learns to weigh modalities in an adaptive fashion. Extensive evaluations on three large-scale benchmark datasets show that our method outperforms prior art by up to +7% on the challenging EPIC-Kitchens dataset including more than 2500 actions, and generalizes to EGTEA Gaze+ and Activitynet. Our approach is also shown to generalize to the tasks of early action recognition and action recognition. Our method was ranked first in the public leaderboard of the EPIC-Kitchens egocentric action anticipation challenge 2019. Web Page - Code.

Verb-Noun Marginal Cross Entropy Loss for Egocentric Action Anticipation

        @inproceedings{furnari2018Leveraging, author = { A. Furnari and S. Battiato and G. M. Farinella }, title = {  Leveraging Uncertainty to Rethink Loss Functions and Evaluation Measures for Egocentric Action Anticipation  }, booktitle = {  International Workshop on Egocentric Perception, Interaction and Computing (EPIC) in conjunction with ECCV  }, pdf = { ../publications/furnari2018Leveraging.pdf }, url = {https://github.com/fpv-iplab/action-anticipation-losses/}, year = { 2018 }, }

Current action anticipation approaches often neglect the in-trinsic uncertainty of future predictions when loss functions or evalua-tion measures are designed. The uncertainty of future observations isespecially relevant in the context of egocentric visual data, which isnaturally exposed to a great deal of variability. Considering the prob-lem of egocentric action anticipation, we investigate how loss functionsand evaluation measures can be designed to explicitly take into accountthe natural multi-modality of future events. In particular, we discusssuitable measures to evaluate egocentric action anticipation and studyhow loss functions can be defined to incorporate the uncertainty aris-ing from the prediction of future events. Experiments performed on theEPIC-KITCHENS dataset show that the proposed loss function allowsimproving the results of both egocentric action anticipation and recog-nition methods. Code

Next-Active-Object-Prediction from Egocentric Video

        @article{furnari2017next, title = {  Next-active-object prediction from egocentric videos  }, journal = {  Journal of Visual Communication and Image Representation  }, volume = {  49  }, number = {  Supplement C  }, pages = {  401 - 411  }, year = {  2017  }, issn = {  1047-3203  }, doi = {  https://doi.org/10.1016/j.jvcir.2017.10.004  }, url = {  http://iplab.dmi.unict.it/NextActiveObjectPrediction/  }, pdf = {https://www.antoninofurnari.it/downloads/publications/furnari2017next.pdf}, author = { Antonino Furnari and Sebastiano Battiato and Kristen Grauman and Giovanni Maria Farinella }, }

Although First Person Vision systems can sense the environment from the user’s perspective, they are generally unable to predict his intentions and goals. Since human activities can be decomposed in terms of atomic actions and interactions with objects, intelligent wearable systems would benefit from the ability to anticipate user-object interactions. Even if this task is not trivial, the First Person Vision paradigm can provide important cues useful to address this challenge. Specifically, we propose to exploit the dynamics of the scene to recognize next-active-objects before an object interaction actually begins. We train a classifier to discriminate trajectories leading to an object activation from all others and perform next-active-object prediction using a sliding window. Next-active-object prediction is performed by analyzing fixed-length trajectory segments within a sliding window. We investigate what properties of egocentric object motion are most discriminative for the task and evaluate the temporal support with respect to which such motion should be considered. The proposed method compares favorably with respect to several baselines on the ADL egocentric dataset which has been acquired by 20 subjects and contains 10 hours of video of unconstrained interactions with several objects. Web Page

Location-Based Temporal Segmentation of Egocentric Videos

        @article{furnari2018personal, pages = { 1-12 }, volume = { 52 }, doi = { https://doi.org/10.1016/j.jvcir.2018.01.019 }, issn = { 1047-3203 }, author = { Antonino Furnari and Sebastiano Battiato and Giovanni Maria Farinella }, url = { http://iplab.dmi.unict.it/PersonalLocationSegmentation/ }, pdf = { ../publications/furnari2018personal.pdf }, year = { 2018 }, journal = { Journal of Visual Communication and Image Representation }, title = { Personal-Location-Based Temporal Segmentation of Egocentric Video for Lifelogging Applications }, }

        @inproceedings{furnari2016temporal, url = { http://iplab.dmi.unict.it/PersonalLocationSegmentation/ }, pdf = { ../publications/furnari2016temporal.pdf }, year = { 2016 }, publisher = { Springer Lecture Notes in Computer Science }, series = { Lecture Notes in Computer Science }, volume = { 9913 }, pages = { 474--489 }, booktitle = { International Workshop on Egocentric Perception, Interaction and Computing (EPIC) in conjunction with ECCV, The Netherlands, Amsterdam, October 9 }, title = { Temporal Segmentation of Egocentric Videos to Highlight Personal Locations of Interest }, author = { Antonino Furnari and Giovanni Maria Farinella and Sebastiano Battiato }, }

Temporal video segmentation can be useful to improve the exploitation of long egocentric videos. Previous work has focused on general purpose methods designed to work on data acquired by different users. In contrast, egocentric data tends to be very personal and meaningful for the user who acquires it. In particular, being able to extract information related to personal locations can be very useful for life-logging related applications such as indexing long egocentric videos, detecting semantically meaningful video segments for later retrieval or summarization, and estimating the amount of time spent at a given location. In this paper, we propose a method to segment egocentric videos on the basis of the locations visited by user. The method is aimed at providing a personalized output and hence it allows the user to specify which locations he wants to keep track of. To account for negative locations (i.e., locations not specified by the user), we propose an effective negative rejection methods which leverages the continuous nature of egocentric videos and does not require any negative sample at training time. To perform experimental analysis, we collected a dataset of egocentric videos containing 10 personal locations of interest. Results show that the method is accurate and compares favorably with the state of the art. Web Page

Recognizing Personal Locations from Egocentric Videos

        @article{furnari2016recognizing, author={Furnari, Antonino and Farinella, Giovanni Maria and Battiato, Sebastiano}, journal={IEEE Transactions on Human-Machine Systems}, title={Recognizing Personal Locations From Egocentric Videos}, year={2016}, doi={10.1109/THMS.2016.2612002}, ISSN={2168-2291}, url={http://iplab.dmi.unict.it/PersonalLocations/}, pdf={../publications/furnari2016recognizing.pdf} }

        @inproceedings{furnari2015recognizing, url = { http://iplab.dmi.unict.it/PersonalLocations/ }, pdf = { ../publications/furnari2015recognizing.pdf }, year = { 2015 }, booktitle = { Workshop on Assistive Computer Vision and Robotics (ACVR) in conjunction with ICCV, Santiago, Chile, December 12 }, page = { 393--401 }, title = { Recognizing Personal Contexts from Egocentric Images }, author = { Antonino Furnari and Giovanni Maria Farinella and Sebastiano Battiato }, }

Contextual awareness in wearable computing allows for construction of intelligent systems which are able to interact with the user in a more natural way. In this paper, we study how personal locations arising from the user’s daily activities can be recognized from egocentric videos. We assume that few training samples are available for learning purposes. Considering the diversity of the devices available on the market, we introduce a benchmark dataset containing egocentric videos of 8 personal locations acquired by a user with 4 different wearable cameras. To make our analysis useful in real-world scenarios, we propose a method to reject negative locations, i.e., those not belonging to any of the categories of interest for the end-user. We assess the performances of the main state-of-the-art representations for scene and object classification on the considered task, as well as the influence of device-specific factors such as the Field of View (FOV) and the wearing modality. Concerning the different device-specific factors, experiments revealed that the best results are obtained using a head-mounted, wide-angular device. Our analysis shows the effectiveness of using representations based on Convolutional Neural Networks (CNN), employing basic transfer learning techniques and an entropy-based rejection algorithm. Web Page