Main Article Content
Answering a question from a given visual image is a very well-known vision language task where the machine is given a pair of an image and a related question and the task is to generate the natural language answer. Humans can easily relate image content with a given question and reason about how to generate an answer. But automation of this task is challenging as it involves many computer vision and NLP tasks. Most of the literature focus on a novel attention mechanism for joining image and question features ignoring the importance of improving the question feature extraction module. Transformers have changed the way spatial and temporal data is processed. This paper exploits the power of Bidirectional Encoder Representation from Transformer (BERT) as a powerful question feature extractor for the VQA model. A novel method of extracting question features by combining output features from four consecutive encoders of BERT has been proposed. This is from the fact that each encoder layer of the transformer attends to features from the word to a phrase and ultimately to a sentence-level representation. A novel BERT-based hierarchical alternating co-attention VQA using the Bottom-up features model has been proposed. Our model is evaluated on the publicly available benchmark dataset VQA v2.0 and experimental results prove that the model improves upon two baseline models by 9.37% and 0.74% respectively.