Main Article Content
Automatically interpreting visuals is one of the challenges that has plagued Artificial Intelligence (AI). It connects the two domains of computer vision and natural language understanding. We employ recent advances in neural networks, such as CNN and RNN, to deliver the finest captions in this research. The model that is single end to end to predict the caption given a photo which unifies the two architecture to create the text utilizing the features. Two forms of discriminator architectures (CNN and LSTM-based structures) are introduced, each with its unique set of benefits. The variety of inscriptions created was forced to a breaking point by these approaches. There should be no assumptions about explicit preconditions in the model. Instead of relying on predetermined forms, standards, or classes, you must figure out how to construct sentences from the preparatory data. The accuracy of the model is proved by comparing it to numerous datasets. Many evaluation indications show that our model is highly accurate. Our model is validated using the benchmark datasets Flickr8K and Flickr30K. One of the approaches used to evaluate is BLEU scores.