Playing with Stability AI’s Image Generation and Editing

Nakamura Hiroki
9 min readJun 3, 2024

--

It’s been about two months since I joined Stability AI. I feel that Stability AI’s recognition is still not as high compared to Stable Diffusion’s. Even if people know about it, they likely have a strong image of the company as one that creates generative AI models like Stable Diffusion.

While it’s true that developing generative AI models is a significant strength, we have recently been increasing our efforts to make these technologies easily accessible.

For example, on the Developer Platform, we are frequently adding new APIs for image generation, image editing, and video generation. In addition to image generation APIs like Stable Diffusion 3, there are also APIs for replacing objects within images (Search & Replace), creating new images while maintaining the structure of the original image, such as changing the style (Control Structure), generating images based on sketches (Control Sketch), and of course, upscaling images.

These APIs often combine multiple models to achieve their functionality, providing not only the underlying models but also features tailored to actual use cases.
Furthermore, we are starting to focus on developing products with user interfaces, such as Stable Artisan, which can be used on Discord, and Stable Assistant, which can be used in a chat format on a web page.

Technology development, product development, and business are very close together, so I believe we will be able to provide even more practical technologies through easy-to-use products in the future. Stay tuned!

That’s all for the brief introduction!

Now that I feel I’ve fulfilled my minimum responsibilities as an employee, I’d like to spend the rest of the time messing around with the publicly available tools. Everything below uses Stable Assistant.

Daruma

First, I was thinking about a topic, and then I found this while looking through my old photos.

As you can see, it’s a daruma.

It’s familiar to Japanese people, for those who are not familiar with it, I asked Stable Assistant what a daruma is. (It’s using Stable LM 7B)

It seems to be somewhat accurate (I don’t actually know much about it)

This daruma was given to me by my boss from two jobs ago. He’s not a native Japanese person, but he’s more Japanese than actual Japanese people. It is customary to draw one eye when starting something and drawing the other eye when it’s completed. The fact that both eyes are drawn means that it probably went well. I forgot something, but it’s good.

Anyway, I want to play with this daruma.

First, let’s try outpainting.
Outpainting is a feature that generates the outside of an image. It can create a world that doesn’t exist.

The original image shows the whole body(?), which is not very interesting(?), so I deliberately narrowed down the world.

Like this

(Input image)

From here, can AI imagine the daruma?
Let’s try outpainting the outside. I’m expecting an encounter with an unknown creature.
.
.
.
.
.

Generated image (Outpaint)

!!??
It’s a daruma.
It moved from the table to a green sheet, but it’s still a daruma. It’s a bit egg-like, but it’s unreasonable to say it’s not a daruma.
It’s wonderful in a good way, not meeting expectations.

Maybe it was a bit too easy.
Let’s crop it more.

Input image

The eyes that I painted over are very rough, so I can still tell it’s a daruma, but it’s quite challenging.
Let’s try outpainting this image again
.
.
.
.
.

Generated image (Outpaint)

I did it!
It’s not a living thing. This is definitely takoyaki.
The octopus is popping out quite a bit.

I’m glad that the world’s AI understands the soul food of Japan’s Kansai region.

By the way, I asked what takoyaki is.

It’s an explanation of the orthodox type. It says it’s delicious and worth trying, so I think it’s correct. (By the way, real takoyaki doesn’t have the octopus popping out so raw. The appearance, smell, and taste are all very good. Just in case.)

Anyway, I’m getting carried away and cropping it even more.

Input image

I have no idea what it is anymore.
So let’s try outpainting.
.
.
.
.

Generated image (Outpaint)

!!??
It’s the ocean.
The daruma (or part of it) is riding the waves. It seems to be getting tossed around by quite rough waves. Looks tough.

I’ve reached the limits of my own interpretation, so I’ll ask how I should take this.

Answer

I see, things can be interpreted in a good way. It’s a model answer.
I want to learn from it too.

I feel bad for cropping it so much, so I’ll try to be nice and restore it.
Let’s try using the Enhance feature. This is part of the upscaling feature, but it not only increases the resolution but also allows you to specify the style with a prompt.

Since it’s round (in shape…), let’s try making it sharp. The input image is the very first daruma image
.
.
.
.
.
.

Generated image (Enhance)

It’s been digitized.
It’s not the daruma I know. No, it’s not a daruma in the first place. Probably. It’s shiny. And the eyes are definitely cameras. Therefore, I can’t paint over them anymore.

I’m getting bored, so let’s move on to the next thing.

Riddles

Suddenly, I tried making riddles.

Just text would be boring, so I tried making ones where you guess the answer from text and images.
For the images, I thought of content that I felt was good and had Stable Assistant create them.

Question:
I start with ‘T’, end with ‘T’, and I have ‘T’ in me. What am I?

Generated imageThere’s something suspiciously steaming in the center of the table. I feel like it’s kind of giving away the answer.

The answer is

.
.
.
.
.

Teapot!

It’s not very interesting.

Without losing heart, I’ll try making one more

Question:
I am full of keys but can’t open any doors. What am I?

Generated imageIf it’s this obvious, it might be confusing instead. I don’t think it will be though.

The answer is
.
.
.
.
.

Keyboard!

The second one was also not great.
Rather than the images themselves, I feel like the premise of creating riddles with images and text was beyond my imagination.

Regardless of the result, I want to believe it was a nice try.

Memes

Next, I want to try making meme images.
With the very simplistic idea that cats are synonymous with memes, I tried making memes featuring cats. (I’m persistently using Stable Assistant)

Generated image

A skeptical-looking cat with raised eyebrows and a cup of coffee, with the caption ‘You expect me to work before my morning coffee?’

Nice expression. If a colleague messages me around 6 am with something non-urgent, it might be good to reply with this meme image. I can’t take responsibility though.

Just in case, I also tried making a meme image to reply to the above meme image.

Generated image

A digital art of a sarcastic-looking dog with a smirk, holding a mug of tea. The dog has an expression that conveys ‘Oh, you need coffee to function? That’s cute.’

If peaceful communication can be established by replying with this meme image, I think it’s a great relationship. If you misread the distance or the atmosphere of the moment even a little, it will be a big accident.

As something a bit more versatile, I tried making an image expressing the Monday blues that would be common among people around the world who work Monday to Friday.

Generated image

a cat with a sad expression, holding a ‘Back to Monday’ sign, with a caption ‘Sunday evening blues’

I feel like meme images of this level can be used a little more casually. The sleepy face creates a good atmosphere. Regardless of whether it will have a good effect if sent to colleagues on Friday evening, I think the message will get across.

Profile

I’m getting tired, but lastly, I want to try editing the most typical profile picture.

The original image is my own profile picture on this blog.

Using Stable Assistant’s “New Image with Structure”, as the name suggests, you can maintain the structure of the original image and change the style.

Anime Style

Generated image (New Image with Structure)

Japanese Ink Painting

Generated image (New Image with Structure)

It looks like a new me is born while retaining the composition and atmosphere of the original image. Still, the composition doesn’t shift at all. Unfortunately, the sloping shoulders remain the same. It’s true, so it can’t be helped.

Finally, I tried making my profile picture sharp with “Enhance”

.
.
.
.
.

Generated image (Enhance)

Who the heck is that?!

That’s all.

At the End

This time, I used Stable Assistant to create images and content within the text (Q&A, etc.) (Sorry, but only English is supported at the moment). Of course, you can also create original applications using the APIs on the Developer Platform.

Also, to emphasize that it can be used casually (?), I made the content quite silly, but of course, it can also create beautiful images. If you look at the videos on the landing page and the Gallery page, you can easily see what kind of content can be created and what features are available. There’s also a free trial, so please try it out if you have time!

While it’s natural to pursue basic performance such as improving the quality of content and increasing controllability, I also want to pursue making it easier for anyone to use. Technologies related to Stable Diffusion are rapidly evolving even within the OSS community, so there is inevitably a time gap between the latest technologies and those that are easy to use. In the future, I want to try to close that gap as much as possible.

Also, in terms of how it can be used, I think there is still a lot of room for expanding its usage, such as using it to broaden thinking and as an aid to express what you want to convey. Furthermore, its potential expands not only as a standalone tool but also by being incorporated into services. In the character AI products I worked on before, text was mainstream. However, by making image and video generation easily accessible, it can expand the means of self-expression and communication for characters.

If anyone has ideas like this or is using it in this way, I’d be happy if you could reach out to me on LinkedIn or other platforms. I want to appropriately give meaning to the technology and make it usable by more people.

(The cover image was a cat doing neural network research)

Generated image

--

--