I added a CLI to it (using Gemini CLI) and submitted a PR, you can run that like so:
GEMINI_API_KEY="..." \
uv run --with https://github.com/minimaxir/gemimg/archive/d6b9d5bbefa1e2ffc3b09086bc0a3ad70ca4ef22.zip \
python -m gemimg "a racoon holding a hand written sign that says I love trash"
The author went to great lengths about open source early on. I wonder if they'll cover the QwenEdit ecosystem.
I'm exceptionally excited about Chinese editing models. They're getting closer and closer to NanoBanana in terms of robustness, and they're open source. This means you can supply masks and kernels and do advanced image operations, integrate them into visual UIs, etc.
You can even fine tune them and create LoRAs that will do the style transferring tasks that Nano Banana falls flat on.
I don't like how closed the frontier US models are, and I hope the Chinese kick our asses.
That said, I love how easy it'll be to distill Nano Banana into a new model. You can pluck training data right out of it: ((any image, any instruction) -> completion) tuples.
I've been keeping an eye on Qwen-Edit/Wan 2.2 shenanigans and they are interesting: however actually running those types of models is too cumbersome and in the end unclear if it's actually worth it over the $0.04/image for Nano Banana.
The author overlooked an interesting error in the second skull pancake image: the strawberry is on the right eye socket (to the left of the image), and the blackberry is on the left eye socket (to the right of the image)!
This looks like it's caused by 99% of the relative directions in image descriptions describing them from the looker's point of view, and that 99% of the ones that aren't it they refer to a human and not to a skull-shaped pancake.
I admit I missed this, which is particularly embarrassing because I point out this exact problem with the character JSON later.
For some offline character JSON prompts I ended up adding an additional "any mentions of left and right are from the character's perspective, NOT the camera's perspective" to the prompt.
To be honest this is the sort of thing Nano Bannana is weak at in my experience. It's absolutely amazing - but doesn't understand left/right/up/down/shrink this/move this/rotate this etc.
See below to demonstrate this weakness with the same prompts as the article see the link below, which demonstrates that it is a model weakness and not just a language ambiguity:
I am a human, and I would have done the same thing as Nano Banana. If the user had wanted a strawberry in the skull's left eye, they should've said, "Put a strawberry in its left eye socket."
Most designers can't, either. Defining a spec is a skill.
It's actually fairly difficult to put to words any specific enough vision such that it becomes understandable outside of your own head. This goes for pretty much anything, too.
Case in point, the final image in this post (the IP bonanza) took 28 iterations of the prompt text to get something maximally interesting, and why that one is very particular about the constraints it invokes, such as specifying "distinct" characters and specifying they are present from "left to right" because the model kept exploiting that ambiguity.
Hey! The author, thank you for this post! QQ, any idea roughly how much this experimentation cost you? I'm having trouble processing their image generation pricing I may just not be finding the right table. I'm just trying to understand if I do like 50 iterations at the quality in the post, how much is that going to cost me?
Use Google AI Studio to submit requests, and to remove watermark, open browser development tools and right click on request to “watermark_4” image and select to block it. And from next generation there will be no watermark!
For images of people generated from scratch, Nano Banana always adds a background blur, it can't seem to create more realistic or candid images such as those taken via a point and shoot or smartphone, has anyone solved this sort of issue? It seems to work alright if you give it an existing image to edit however. I saw some other threads online about it but I didn't see anyone come up with solutions.
>Nano Banana is terrible at style transfer even with prompt engineering shenanigans
My context: I'm kind of fixated on visualizing my neighborhood as it would have appeared in the 18th century. I've been doing it in Sketchup, and then in Twinmotion, but neither of those produce "photorealistic" images... Twinmotion can get pretty close with a lot of work, but that's easier with modern architecture than it is with the more hand-made, brick-by-brick structures I'm modeling out.
As different AI image generators have emerged, I've tried them all in an effort to add the proverbial rough edges to snapshots of the models I've created, and it was not until Nano Banana that I ever saw anything even remotely workable.
Nano Banana manages to maintain the geometry of the scene, while applying new styles to it. Sometimes I do this with my Twinmotion renders, but what's really been cool to see is how well it takes a drawing, or engraving, or watercolor - and with as simple a prompt as "make this into a photo" it generates phenomenal results.
Similarly to the Paladin/Starbucks/Pirate example in the link though, I find that sometimes I need to misdirect a little bit, because if I'm peppering the prompt with details about the 18th century, I sometimes get a painterly image back. Instead, I'll tell it I want it to look like a photograph of a well preserved historic neighborhood, or a scene from a period film set in the 18th century.
As fantastic as the results can be, I'm not abandoning my manual modeling of these buildings and scenes. However, Nano Banana's interpretation of contemporary illustrations has helped me reshape how I think about some of the assumptions I made in my own models.
In my own experience, nano banana still has the tendency to:
- make massive, seemingly random edits to images
- adjust image scale
- make very small case but pervasive detail changes obvious in an image diff
For instance, I have found that nano-banana will sporadically add a (convincing) fireplace to a room or new garage behind a house. This happens even with explicit "ALL CAPS" instructions not to do so. This happens sporadically, even when the temperature is set to zero, and makes it impossible to build a reliable app.
The "ALL CAPS" part of your comment got me thinking. I imagine most llms understand subtle meanings of upper case text use depending on context. But, as I understand it, ALL CAPS text will tokenize differently than lower case text. Is that right? In that case, won't the upper case be harder to understand and follow for most models since it's less common in datasets?
There's more than enough ALL CAPS text in the corpus of the entire internet, and enough semantic context associated with it for it to be intended to be in the imperative voice.
That's on my list of blog-post-worthy things to test, namely text rendering to image in Python directly and passing both input images to the model for compositing.
Theres lots these models can do but I despise when people suggest they can do edits with "with only the necessary aspects changed".
No, that simply is not true. If you actually compare the before and after you can see it still regenerates all the details on the "unchanged" aspects. Texture, lighting, sharpness, even scale its all different even if varyingly similar to the original.
Sure they're cute for casual edits but it really pains me people suggesting these things are suitable replacements for actual photo editing. Especially when it comes to people, or details outside their training data theres a lot of nuance that can be lost as it regenerates them no matter how you prompt things.
Nano Banana is different and much better at edits without changing texture/lighting/sharpness/color balance, and I am someone that is extremely picky about it. That's why I add the note that Gemini 2.5 Flash is aware of segmentation masks, and that's my hunch why that's the case.
> It’s one of the best results I’ve seen for this particular test, and it’s one that doesn’t have obvious signs of “AI slop” aside from the ridiculous premise.
It’s pretty good, but one conspicuous thing is that most of the blueberries are pointing upwards.
The kicker for nano banana is not prompt adherence which is a really nice to have but the fact that it's either working on pixel space or with a really low spatial scaling. It's the only model that doesn't kill your details because of vae encode/decode.
You CREATED something, and I like to think that creating things that I love and enjoy and that others can love and enjoy makes creating things worth it.
Not really since "prompt engineering" can be tossed in the same pile as "vibe coding." Just people coping with not developing the actual skills to produce the desired products.
Try getting a small model to do what you want quickly with high accuracy, high quality, etc, and using few tokens per request. You'll find out that prompt engineering is real and matters.
Gemini is instructed to reply in images first, and if it thinks, to think using the image thinking tags. It cannot seemingly be prompted to show verbatim the result 4+5 without showing the answer 4+5=9. Of course it can show whatever exact text that you want, the question is, does it prompt rewrite (no) or do something else (yes)?
We can do the same exercises with Flux Kontext for editing versus Flash-2.5, if you think that editing is somehow unique in this regard.
Is prompt rewriting "thinking"? My point is, this article can't answer that question without dElViNg into the nuances of what multi-modal models really are.
I like the Python library that accompanies this: https://github.com/minimaxir/gemimg
I added a CLI to it (using Gemini CLI) and submitted a PR, you can run that like so:
Result in this comment: https://github.com/minimaxir/gemimg/pull/7#issuecomment-3529...The author went to great lengths about open source early on. I wonder if they'll cover the QwenEdit ecosystem.
I'm exceptionally excited about Chinese editing models. They're getting closer and closer to NanoBanana in terms of robustness, and they're open source. This means you can supply masks and kernels and do advanced image operations, integrate them into visual UIs, etc.
You can even fine tune them and create LoRAs that will do the style transferring tasks that Nano Banana falls flat on.
I don't like how closed the frontier US models are, and I hope the Chinese kick our asses.
That said, I love how easy it'll be to distill Nano Banana into a new model. You can pluck training data right out of it: ((any image, any instruction) -> completion) tuples.
I've been keeping an eye on Qwen-Edit/Wan 2.2 shenanigans and they are interesting: however actually running those types of models is too cumbersome and in the end unclear if it's actually worth it over the $0.04/image for Nano Banana.
The author overlooked an interesting error in the second skull pancake image: the strawberry is on the right eye socket (to the left of the image), and the blackberry is on the left eye socket (to the right of the image)!
This looks like it's caused by 99% of the relative directions in image descriptions describing them from the looker's point of view, and that 99% of the ones that aren't it they refer to a human and not to a skull-shaped pancake.
I admit I missed this, which is particularly embarrassing because I point out this exact problem with the character JSON later.
For some offline character JSON prompts I ended up adding an additional "any mentions of left and right are from the character's perspective, NOT the camera's perspective" to the prompt.
I picked up on that also. I feel that a lot of humans would also get confused about whether you mean the eye on the left, or the subject's left eye.
To be honest this is the sort of thing Nano Bannana is weak at in my experience. It's absolutely amazing - but doesn't understand left/right/up/down/shrink this/move this/rotate this etc.
See below to demonstrate this weakness with the same prompts as the article see the link below, which demonstrates that it is a model weakness and not just a language ambiguity:
https://gemini.google.com/share/a024d11786fc
I am a human, and I would have done the same thing as Nano Banana. If the user had wanted a strawberry in the skull's left eye, they should've said, "Put a strawberry in its left eye socket."
"prompt engineered"...i.e. by typing in what you want to see.
Not all models can actually do that if your prompt is particular
Most designers can't, either. Defining a spec is a skill.
It's actually fairly difficult to put to words any specific enough vision such that it becomes understandable outside of your own head. This goes for pretty much anything, too.
Yep, knowing how and what to ask is a skill.
For anything, even back in the "classical" search days.
We understand now that we interface with LLMs using natural and unnatural language as the user interface.
This is a very different fuzzy interface compared to programming languages.
There will be techniques better or worse at interfacing.
This is what the term prompt engineering is alluding to since we don’t have the full suite of language to describe this yet.
... and then iterating on that prompt many times, based on your accumulated knowledge of how best to prompt that particular model.
Case in point, the final image in this post (the IP bonanza) took 28 iterations of the prompt text to get something maximally interesting, and why that one is very particular about the constraints it invokes, such as specifying "distinct" characters and specifying they are present from "left to right" because the model kept exploiting that ambiguity.
Hey! The author, thank you for this post! QQ, any idea roughly how much this experimentation cost you? I'm having trouble processing their image generation pricing I may just not be finding the right table. I'm just trying to understand if I do like 50 iterations at the quality in the post, how much is that going to cost me?
"amenable to highly specific and granular instruction"
Use Google AI Studio to submit requests, and to remove watermark, open browser development tools and right click on request to “watermark_4” image and select to block it. And from next generation there will be no watermark!
For images of people generated from scratch, Nano Banana always adds a background blur, it can't seem to create more realistic or candid images such as those taken via a point and shoot or smartphone, has anyone solved this sort of issue? It seems to work alright if you give it an existing image to edit however. I saw some other threads online about it but I didn't see anyone come up with solutions.
I was kind of surprised by this line:
>Nano Banana is terrible at style transfer even with prompt engineering shenanigans
My context: I'm kind of fixated on visualizing my neighborhood as it would have appeared in the 18th century. I've been doing it in Sketchup, and then in Twinmotion, but neither of those produce "photorealistic" images... Twinmotion can get pretty close with a lot of work, but that's easier with modern architecture than it is with the more hand-made, brick-by-brick structures I'm modeling out.
As different AI image generators have emerged, I've tried them all in an effort to add the proverbial rough edges to snapshots of the models I've created, and it was not until Nano Banana that I ever saw anything even remotely workable.
Nano Banana manages to maintain the geometry of the scene, while applying new styles to it. Sometimes I do this with my Twinmotion renders, but what's really been cool to see is how well it takes a drawing, or engraving, or watercolor - and with as simple a prompt as "make this into a photo" it generates phenomenal results.
Similarly to the Paladin/Starbucks/Pirate example in the link though, I find that sometimes I need to misdirect a little bit, because if I'm peppering the prompt with details about the 18th century, I sometimes get a painterly image back. Instead, I'll tell it I want it to look like a photograph of a well preserved historic neighborhood, or a scene from a period film set in the 18th century.
As fantastic as the results can be, I'm not abandoning my manual modeling of these buildings and scenes. However, Nano Banana's interpretation of contemporary illustrations has helped me reshape how I think about some of the assumptions I made in my own models.
You can't take a highly artistic image and supply it as a style reference. Nano Banana can't generalize to anything not in its training.
Well, I just asked it for a 13-sided irregular polygon (is it that hard?)…
https://imgur.com/a/llN7V0W
The blueberry and strawberry are not actually where they prompted.
In my own experience, nano banana still has the tendency to:
- make massive, seemingly random edits to images - adjust image scale - make very small case but pervasive detail changes obvious in an image diff
For instance, I have found that nano-banana will sporadically add a (convincing) fireplace to a room or new garage behind a house. This happens even with explicit "ALL CAPS" instructions not to do so. This happens sporadically, even when the temperature is set to zero, and makes it impossible to build a reliable app.
Has anyone had a better experience?
The "ALL CAPS" part of your comment got me thinking. I imagine most llms understand subtle meanings of upper case text use depending on context. But, as I understand it, ALL CAPS text will tokenize differently than lower case text. Is that right? In that case, won't the upper case be harder to understand and follow for most models since it's less common in datasets?
There's more than enough ALL CAPS text in the corpus of the entire internet, and enough semantic context associated with it for it to be intended to be in the imperative voice.
It's really cool how good of a job it did rendering a page given its HTML code. I was not expecting it to do nearly as well.
> Nano Banana is still bad at rendering text perfectly/without typos as most image generation models.
I figured that if you write the text in Google docs and share the screenshot with banana it will not make any spelling mistake.
So, use something like "can you write my name on this Wimbledon trophy, both images are attached. Use them" will work.
Google's example documentation for Nano Banana does demo that pipeline: https://ai.google.dev/gemini-api/docs/image-generation#pytho...
That's on my list of blog-post-worthy things to test, namely text rendering to image in Python directly and passing both input images to the model for compositing.
Theres lots these models can do but I despise when people suggest they can do edits with "with only the necessary aspects changed".
No, that simply is not true. If you actually compare the before and after you can see it still regenerates all the details on the "unchanged" aspects. Texture, lighting, sharpness, even scale its all different even if varyingly similar to the original.
Sure they're cute for casual edits but it really pains me people suggesting these things are suitable replacements for actual photo editing. Especially when it comes to people, or details outside their training data theres a lot of nuance that can be lost as it regenerates them no matter how you prompt things.
Even if you
Nano Banana is different and much better at edits without changing texture/lighting/sharpness/color balance, and I am someone that is extremely picky about it. That's why I add the note that Gemini 2.5 Flash is aware of segmentation masks, and that's my hunch why that's the case.
That is true for gpt-image-1 but not nano-banana. They can do masked image changes
Nano banana has a really low spatial scaling and doesn't affect details like other models.
> It’s one of the best results I’ve seen for this particular test, and it’s one that doesn’t have obvious signs of “AI slop” aside from the ridiculous premise.
It’s pretty good, but one conspicuous thing is that most of the blueberries are pointing upwards.
The kicker for nano banana is not prompt adherence which is a really nice to have but the fact that it's either working on pixel space or with a really low spatial scaling. It's the only model that doesn't kill your details because of vae encode/decode.
I'm getting annoyed by using "prompt engineered" as a verb. Does this mean I'm finally old and bitter?
(Do we say we software engineered something?)
I think it’s meant to be engineering in the same sense as “social engineering”.
You're definitely old and bitter, welcome to it.
You CREATED something, and I like to think that creating things that I love and enjoy and that others can love and enjoy makes creating things worth it.
Don't get me wrong, I have nothing against using AI as an expression of creativity :)
Create? So I have created all that code I'm running on my site, yes is bad I know, but thank you very much! Such creative guy I was!
Not really since "prompt engineering" can be tossed in the same pile as "vibe coding." Just people coping with not developing the actual skills to produce the desired products.
Try getting a small model to do what you want quickly with high accuracy, high quality, etc, and using few tokens per request. You'll find out that prompt engineering is real and matters.
Couldn't care less. I don't need to know how to do literally everything. AI fills in my gaps and I'm a ton more productive.
I wouldn't bother trying to convince people who are upset that others have figured out a way to use LLMs. It's not logical.
lots of words
okay, look at imagen 4 ultra:
https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
In this link, Imagen is instructed to render the verbatim prompt “the result of 4+5”, which shows that text, and not instructed, which renders “4+5=9”
Is Imagen thinking?
Let's compare to gemini 2.5 flash image (nano banana):
look carefully at the system prompt here: https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
Gemini is instructed to reply in images first, and if it thinks, to think using the image thinking tags. It cannot seemingly be prompted to show verbatim the result 4+5 without showing the answer 4+5=9. Of course it can show whatever exact text that you want, the question is, does it prompt rewrite (no) or do something else (yes)?
compare to ideogram, with prompt rewriting: https://ideogram.ai/g/GRuZRTY7TmilGUHnks-Mjg/0
without prompt rewriting: https://ideogram.ai/g/yKV3EwULRKOu6LDCsSvZUg/2
We can do the same exercises with Flux Kontext for editing versus Flash-2.5, if you think that editing is somehow unique in this regard.
Is prompt rewriting "thinking"? My point is, this article can't answer that question without dElViNg into the nuances of what multi-modal models really are.
Can you provide screenshots or links that don't require login
sorry, but I don't understand you post. those links don't work.