S. Abu-el-haija, N. Kothari, J. Lee, P. Natsev, G. Toderici et al., Youtube-8m: A large-scale video classification benchmark, 2016.

M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt, Sequential deep learning for human action recognition, 2011.
DOI : 10.1007/978-3-642-25446-8_4

URL : https://hal.archives-ouvertes.fr/hal-01354493

F. Baradel, C. Wolf, J. Mille, and G. Taylor, Glimpse clouds: Human activity recognition from unstructured feature points, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01713109

P. W. Battaglia, R. Pascanu, M. Lai, D. J. Rezende, and K. Kavukcuoglu, Interaction networks for learning about objects, relations and physics, 2016.

Z. Bolei, A. A. Zhang, and A. Torralba, Temporal relational reasoning in videos, 2018.

J. Carreira and A. Zisserman, Quo vadis, action recognition? a new model and the kinetics dataset, 2017.

D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari et al., Scaling egocentric vision: The epic-kitchens dataset, 2018.

J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan et al., Long-term recurrent convolutional networks for visual recognition and description, 2015.
DOI : 10.21236/ada623249

URL : http://www.dtic.mil/dtic/tr/fulltext/u2/a623249.pdf

B. Duke, Lintel: Python video decoding, 2018.

F. Fleuret, T. Li, C. Dubout, E. K. Wampler, S. Yantis et al., Comparing machines and humans on a visual categorization test, Proceedings of the National Academy of Sciences of the United States of America, vol.108, pp.17621-17626, 2011.
DOI : 10.1073/pnas.1109168108

URL : http://www.pnas.org/content/108/43/17621.full.pdf

D. F. Fouhey, W. Kuo, A. A. Efros, and J. Malik, From lifestyle vlogs to everyday interactions, 2018.

R. Goyal, S. Ebrahimi-kahou, V. Michalski, J. Materzynska, S. Westphal et al., The "something something" video database for learning and evaluating visual common sense, 2017.
DOI : 10.1109/iccv.2017.622

URL : http://arxiv.org/pdf/1706.04261

C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru et al., Ava: A video dataset of spatio-temporally localized atomic visual actions, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01764300

K. He, G. Gkioxari, P. Dollar, and R. Girshick, Mask r-cnn, 2017.
DOI : 10.1109/tpami.2018.2844175

K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, 2016.
DOI : 10.1109/cvpr.2016.90

URL : http://arxiv.org/pdf/1512.03385

S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Computation, vol.9, issue.8, pp.1735-1780, 1997.

D. Hudson and C. Manning, Compositional attention networks for machine reasoning, 2018.

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar et al., Largescale video classification with convolutional neural networks, 2014.
DOI : 10.1109/cvpr.2014.223

URL : http://www.cs.cmu.edu/~rahuls/pub/cvpr2014-deepvideo-rahuls.pdf

J. Kim, M. Ricci, and T. Serre, Not-so-CLEVR: Visual relations strain feedforward neural networks, 2018.
DOI : 10.1098/rsfs.2018.0011

D. Kingma and J. Ba, Adam: A method for stochastic optimization, p.ICML, 2015.

R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata et al., Visual genome: Connecting language and vision using crowdsourced dense image annotations, International Journal of Computer Vision (IJCV), vol.123, pp.32-73, 2017.
DOI : 10.1007/s11263-016-0981-7

URL : https://link.springer.com/content/pdf/10.1007%2Fs11263-016-0981-7.pdf

P. Luc, N. Neverova, C. Couprie, J. Verbeek, and Y. Lecun, Predicting deeper into the future of semantic segmentation, 2017.
DOI : 10.1109/iccv.2017.77

URL : https://hal.archives-ouvertes.fr/hal-01494296

D. Luvizon, D. Picard, and H. Tabia, 2d/3d pose estimation and action recognition using multitask deep learning, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01815703

M. Monfort, B. Zhou, S. A. Bargal, A. Andonian, T. Yan et al., Moments in time dataset: one million videos for event understanding, 2018.

E. Perez, H. D. Vries, F. Strub, V. Dumoulin, and A. Courville, Learning visual reasoning without strong priors, ICML Machine Learning in Speech and Language Processing Workshop, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01648684

L. C. Pickup, Z. Pan, D. Wei, Y. Shih, C. Zhang et al., Seeing the arrow of time, 2014.

S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, 2015.
DOI : 10.1109/tpami.2016.2577031

URL : http://arxiv.org/pdf/1506.01497

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh et al., Imagenet large scale visual recognition challenge, IJCV, vol.115, issue.3, pp.211-252, 2015.
DOI : 10.1007/s11263-015-0816-y

URL : http://dspace.mit.edu/bitstream/1721.1/104944/1/11263_2015_Article_816.pdf

A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu et al., A simple neural network module for relational reasoning, 2017.

A. Shahroudy, J. Liu, T. T. Ng, and G. Wang, NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis, 2016.
DOI : 10.1109/cvpr.2016.115

URL : http://arxiv.org/pdf/1604.02808

S. Sharma, R. Kiros, and R. Salakhutdinov, Action recognition using visual attention, ICLR Workshop, 2016.

K. Simonyan and A. Zisserman, Two-stream convolutional networks for action recognition in videos, 2014.

S. Song, C. Lan, J. Xing, W. Zeng, and J. Liu, An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data, 2016.

S. Stabinger, A. Rodríguez-sánchez, and J. Piater, 25 years of CNNs: Can we compare to human abstraction capabilities, 2016.

S. Steenkiste, M. Chang, K. Greff, and J. Schmidhuber, Relational neural expectation maximization: Unsupervised discovery of objects and their interactions, 2018.

L. Sun, K. Jia, K. Chen, D. Yeung, B. E. Shi et al., Lattice long shortterm memory for human action recognition, 2017.
DOI : 10.1109/iccv.2017.236

URL : http://arxiv.org/pdf/1708.03958

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, Learning spatiotemporal features with 3d convolutional networks, 2015.
DOI : 10.1109/iccv.2015.510

URL : http://arxiv.org/pdf/1412.0767

P. Veli?kovi´veli?kovi´c, G. Cucurull, A. Casanova, A. Romero, P. Lì-o et al., Graph attention networks, 2018.

H. Wang, A. Kläser, C. Schmid, and C. L. Liu, Action Recognition by Dense Trajectories, 2011.
DOI : 10.1109/cvpr.2011.5995407

URL : https://hal.archives-ouvertes.fr/inria-00583818

N. Watters, D. Zoran, T. Weber, P. Battaglia, R. Pascanu et al., Visual interaction networks: Learning a physics simulator from video, 2017.

S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy, Rethinking spatiotemporal feature learning for video understanding, 2017.