MLLMs, VLMs, LVLMs, LMMs...

There exists a class of models whose inputs are text prompts + images or video. Their outputs are text. Example: “Explain the joke in this tweet. Be concise.” Answer, courtesy of GPT4o: The joke humorously compares “the talk” about sensitive topics with explaining to kids why there’s a server at home. The mock children’s book title exaggerates the idea, poking fun at tech enthusiasts whose home servers are significant enough to require a formal explanation to their kids. ...

December 11, 2024 · Jim Robinson-Bohnslav