MiniGPT-4 is another example of open-source AI on the rise


OpenAI introduced GPT-4 as a multimodal model with image understanding, but has not yet released the image part of the model. MiniGPT-4 makes it available today – as an open-source model.

MiniGPT-4 is a chatbot with image understanding. This is a feature that OpenAI introduced at the launch of GPT-4, but has not yet been released outside the Be my Eyes app.

Like its larger counterpart, MiniGPT-4 can describe images or answer questions about the content of an image: for example, given a picture of a prepared dish, the model can output a (possibly) matching recipe (see featured image) or generate an appropriate image description for visually impaired people. Similar to Midjourney’s new “/describe” feature, MiniGPT-4 could extract prompts from images, or at least some ideas. OpenAI’s much-touted image-to-website feature, introduced at the GPT-4 launch, can also be done with MiniGPT-4, according to the researchers.

MiniGPT-4 generates matching HTML code based on a hand-drawn web page sketch. | Image: Zhu, Chen et al.

“Our findings reveal that MiniGPT-4 processes many capabilities similar to those exhibited by GPT-4 like detailed image description generation and website creation from hand-written drafts,” the paper states.


The development team makes the code, demos, and training instructions for MiniGPT-4 available on Github. They also announce a smaller version of the model that will run on a single Nvidia 3090 graphics card. The demo video below shows some examples.

Open-source AI is on the rise

The remarkable thing about MiniGPT-4 is that it is based on the Vicuna-13B LLM and the BLIP-2 Vision Language Modelopen-source software that can be trained and fine-tuned for comparatively little money and without massive data and computational overhead.

The research team first trained MiniGPT-4 with about five million image-text pairs in ten hours on four Nvidia A100 cards. In a second step, the model was refined with 3,500 high-quality text-image pairs generated by an interaction between MiniGPT-4 and ChatGPT. ChatGPT corrected the incorrect or inaccurate image descriptions generated by MiniGPT-4.

Fix the error in the given paragraph. Remove any repeating sentences, meaningless characters, not English sentences, and so on. Remove unnecessary repetition. Rewrite any incomplete sentences. Return directly the results without explanation. Return directly the input paragraph if it is already correct without explanation.

ChatGPT prompt for MiniGPT-4

This second step significantly improved the reliability and usability of the model – and required only seven minutes of training on a single Nvidia A100. The researchers themselves said they were surprised by the efficiency of their approach.

MiniGPT-4 Vicuna’s language model follows the “Alpaca formula” and uses ChatGPT’s output to fine-tune a Meta language model of the LLaMA family. Vicuna is said to be on by Google Bard and ChatGPT, again with a relatively small training effort.


Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top