At first there were lots of obvious issues, such as poorly transcribed text or inaccurate descriptions containing AI hallucinations. “We are getting such exceptional feedback,” says Jesper Hvirring Henriksen, chief technology officer of Be My Eyes. The early trials went well enough that Be My Eyes is now in the process rolling out the AI-powered version of its app to all its users. In March OpenAI started testing its multimodal version of GPT-4 through the company Be My Eyes, which provides a free description service through an app of the same name for blind and low-sighted people. Credit: Lauren Lefferįor disabled communities, the applications of such tech are particularly exciting. And ChatGPT also outperformed Bard in accurately identifying insects from photographs.īased on this photograph of a potted plant, two multimodal AI-powered chatbots-OpenAI’s ChatGPT (a version powered by GPT-4V) and Google’s Bard-accurately estimated the size of the container. Both identified the Statue of Liberty from a single photograph, deduced that the image was snapped from an office in lower Manhattan and offered spot-on directions from the photographer’s original location to the landmark (though ChatGPT’s guidance was more detailed than Bard’s). In another trial, when given a photograph of a stocked bookshelf, both chatbots offered detailed descriptions of the hypothetical owner’s supposed character and interests that were almost like AI-generated horoscopes. Bard did nearly as well, but it interpreted one “9” as a “0,” thus flubbing the final total. Altogether, the task took less than 30 seconds. In our test, using only a photograph of a receipt and a two-line prompt, ChatGPT accurately split a complicated bar tab and calculated the amount owed for each of four different people-including tip and tax. These abilities have myriad applications. Both can both hold hands-free vocal conversations using only audio, and they can describe scenes within images and decipher lines of text in a picture. Scientific American tested out two different chatbots that rely on multimodal LLMs: a version of ChatGPT powered by the updated GPT-4 (dubbed GPT-4 with vision, or GPT-4V) and Bard, which is currently powered by Google’s PaLM 2 model. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today. If you're enjoying this article, consider supporting our award-winning journalism by subscribing. What Can Multimodal AI Do? On supporting science journalism Though it is in its infancy, the burgeoning technology can perform a variety of tasks. Meta, too, announced big strides in multimodality this past spring. Google began incorporating similar image and audio features to those offered by the new GPT-4 into some versions of its LLM-powered chatbot, Bard, back in May. OpenAI released a multimodal version of ChatGPT, powered by its LLM GPT-4, to paying subscribers for the first time last week, months after the company first announced these capabilities. And now multimodal AIs that are capable of parsing not only text but also images, audio, and more are on the rise. Since that time, these chatbots have demonstrated an impressive capacity for generating text and code, albeit not always accurately. Its arrival ushered in an era of nonstop headlines about artificial intelligence and accelerated the development of competing large language models (LLMs) from Google, Meta and other tech giants. Slightly more than 10 months ago OpenAI’s ChatGPT was first released to the public.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |