CM3leon is the new buzz in the AI text-to-image settlement presented by Meta. Let’s know more about Meta’s new AI project called – CM3leon (pronounced as “Chameleon” – the animal). Infamously in the news for taking inspiration from Twitter for the new platform, Meta also announced their new AI Image generator.
The name suggests that it can generate visuals at its will (the user’s will in this case).
CM3leon – a Causal masked mixed modal (CM3) transformer is based on a LAION dataset (Large-Scale Artificial Intelligence Open Network); probably that’s where the name comes from.
It generates images and text from random sequences of available images and texts. Let’s learn more about the new text-to-image and image-to-text generators!
CM3leon – Overview
Here are some quick facts about CM3leon and what it’s based on. This essentially includes the structure CM3leon is based on, and the extent of computing it is trained.
|Architecture||Decoder only transformer|
|Compute training||Less than 30% of DALL E|
|Training||Text only language models and licensed images|
FID – Fréchet Inception Distance
What is CM3leon?
It is a single-foundation model launched by Meta. A new entrant into the AI text-to-image boundaries, CM3leon stretches it even further with image-to-text generation. CM3leon uses a token-based autoregressive model instead of GAN and diffusion models used by likes of Open AI, Google, Microsoft, and Midjourney.
CM3leon scored a 4.88 FID (Fréchet Inception Distance) score on zero-shot MS-COCO ( a large-scale object detection, segmentation, and captioning data set of Microsoft. It even outperformed Google’s text-to-image model Parti.
This has set a benchmark and also helped me understand the potential of CM3leon to interpret complex prompts. CM3leon has been trained on only 3 billion text tokens and stands equal to other generation models trained comprehensively.
How does CM3leon work?
CM3leon takes the help of text-only language models to understand segmentation. CM3leon works with a licensed dataset with retrieval-augmented pre-training, which does not just takes images from the internet for learning but also takes texts.
The next step includes a supervised fine-tuning stage (SFT) which is a similar approach that Open AI took for training ChatGPT. This has helped CM3leon to learn complex prompts and use them in image generation.
A Causal masked mixed-modal (CM3) that creates images and texts based on arbitrary sequences. CM3leon is trained on a wide range of multi-tasking instructions similar to a text-only generative model (like ChatGPT), to learn from both texts and images.
The research found that this effectively takes a lot less time to generate a caption for an image, answer a visual question, edit with texts, and for conditional image generation.
Its true efficiency can be seen in image generations and caption generations from an image and the ease of editing an image.
Bonus: See how we compared Midjourney and DALL.E 2 and know who came on top in this face-off!
With all the technicals aside, what is proved on paper also translates to the way it has generated images. The below features include images and prompts given to the model as released by Meta.
The images were generated with simple prompts; for example, the first image prompt was “A small cactus wearing a straw hat and neon sunglasses in the Sahara desert”
The image seems pretty clean and followed the prompts. What’s notable is the fourth image, where the prompted image generated the numbers correctly as opposed to other image generators that fail to get signs and symbols right
Here’s an example of the same cactus prompt we ran on the Bing Image generator.
2. Text-guided image editing
Image editing seems possible with simple texts, users don’t have to regenerate an entire image and can replace a specific object without complex prompts and parameters.
3. Text tasks
When asked, CM3leon can understand prompts related to an image. Upon asking to describe an image, CM3leon efficiently identifies the subject, object, and background of the image and answers questions based on it.
4. Structure-Guided Image Editing
It realizes structural bounds and generates images limited to that when asked. This results in a visually understandable and relevant prompt image. Like in the image below, all four generations take into account the prompts and aptly place them.
The use of numbers in the above prompt is yet unclear. Looking at it from a mathematical POV, these look like the coordinates that a user needs for a certain object in the frame of the image.
This is the segmented image as the way the AI understands the prompt. In simpler terms, the idea of a certain thing without any meaning to it.
6. Super-resolution results
This is the way to generate a higher-resolution, more refined picture This works well than other models again because of the reason that you don’t need complex commands to extract it
This is claimed to be better than available image generators in the industry with it’s ease of prompt understanding and owing to its dynamic training model.
Some of the ways that it can leave the likes of DALL E 2 and Midjourney behind are:
- Accurate human anatomy representation
- Image-to-image, text-to-image capabilities
- Low training costs
- Absence of complex prompts and parameters
- Effective, prompt generation
- Image Description
- Structured guided image editing
Here’s more of our AI image generator-related content:
Verdict – Is CM3leon Any Better?
We can’t say if it is any better than the well-established AI text-to-image giants like Midjourney and DALL E 2. This is still only an announcement of what potential the new model holds.
It is still not clear the speed that it takes to generate an image, the variation or upscaling options it has or even the platform on which it would be available.
Meta intends to boost creativity and find better applications of this in the Metaverse to turn investors in their favor (maybe). There has not been a mention of CM3leon being a commercial product competing with others.
Only the future will tell what Meta plans to do with this new entrant. If it will challenge Google, Open AI and Microsoft by combining this with an LLM or use it for Metaverse curation, who knows?
It is an AI single-foundation model capable of text-to-image and image-to-text generation. It is based on a training model which learns from a licensed dataset and from text-based models.
It is pronounced as “Chameleon – kə-mēl′yən, -mē′lē-ən” like the animal.
Currently, CM3leon is only in its development stage, with only an announcement of the model. If that will translate to a service or will be integrated for a larger purpose is not yet known.