Thursday, June 13th 2024

Stability AI Outs Stable Diffusion 3 Medium, Company's Most Advanced Image Generation Model

Stability AI, a maker of various generative AI models and the company behind text-to-image Stable Diffusion models, has released its latest Stable Diffusion 3 (SD3) Medium AI model. Running on two billion dense parameters, the SD3 Medium is the company's most advanced text-to-image model to date. It boasts features like generating highly realistic and detailed images across a wide range of styles and compositions. It demonstrates capabilities in handling intricate prompts that involve spatial reasoning, actions, and diverse artistic directions. The model's innovative architecture, including the 16-channel variational autoencoder (VAE), allows it to overcome common challenges faced by other models, such as accurately rendering realistic human faces and hands.

Additionally, it achieves exceptional text quality, with precise letter formation, kerning, and spacing, thanks to the Diffusion Transformer architecture. Notably, the model is resource-efficient, capable of running smoothly on consumer-grade GPUs without compromising performance due to its low VRAM footprint. Furthermore, it exhibits impressive fine-tuning abilities, allowing it to absorb and replicate nuanced details from small datasets, making it highly customizable for specific use cases that users may have. Being an open-weight model, it is available for download on HuggingFace, and it has libraries optimized for both NVIDIA's TensorRT (all modern NVIDIA GPUs) and AMD Radeon/Instinct GPUs.
Source: Stability AI
Add your own comment

12 Comments on Stability AI Outs Stable Diffusion 3 Medium, Company's Most Advanced Image Generation Model

#1
Firedrops
Y'all should go see what real outputs (twitter, reddit) are like from this model, it's so bad it's funny.
Posted on Reply
#2
Chrispy_
woohoo, a focus on VRAM footprint.

Nvidia's paltry VRAM allocations are a big hurdle for creators just wanting to test out AI and VRAM optimisation is welcome.
Posted on Reply
#3
evernessince
Probably the worst Stable Diffusion release to date. It's god awful at doing anatomy, particularly females. Worse than SD1.5 bad. The licensing changes are killing the fine-tuning community to boot.

It might be good if you only do concept art, graphics, ect but given that there won't be nearly as many fine tunes that means there won't be as many creative options as well.
Chrispy_woohoo, a focus on VRAM footprint.

Nvidia's paltry VRAM allocations are a big hurdle for creators just wanting to test out AI and VRAM optimisation is welcome.
They mean in regards to full 8B model. SD3 2B (this release) uses 16.3GB of VRAM without any LORA or IPAdapter which means you are going to want at least a 4080 for best performance. The vast majority of Nvidia users, including those of their upcoming cards if rumors are correct, will be SOL VRAM wise.
Posted on Reply
#4
mashie
Can it draw human hands with say all 5 fingers in the right locations?
Posted on Reply
#5
evernessince
mashieCan it draw human hands with say all 5 fingers in the right locations?
Nope, it has big issues with that. Both SD1.5 and SDXL are better in that regard (and both of those are far from perfect at hands as well).
Posted on Reply
#6
remixedcat
I use SD thru a mobile app and it won't render anime catgirls wearing leather pants but it allows rendering of very nasty stuff that is too hot for this site (in a bad way)

Used to but this update made it not be able to
Posted on Reply
#7
JWNoctis
evernessinceProbably the worst Stable Diffusion release to date. It's god awful at doing anatomy, particularly females. Worse than SD1.5 bad. The licensing changes are killing the fine-tuning community to boot.

It might be good if you only do concept art, graphics, ect but given that there won't be nearly as many fine tunes that means there won't be as many creative options as well.

They mean in regards to full 8B model. SD3 2B (this release) uses 16.3GB of VRAM without any LORA or IPAdapter which means you are going to want at least a 4080 for best performance. The vast majority of Nvidia users, including those of their upcoming cards if rumors are correct, will be SOL VRAM wise.
To be fair, the model does show composition and spelling capabilities previously unseen in any open-weight model. That rather overwrought safety engineering could theoretically be moderated or entirely reversed, even given the licensing restrictions, though that leads to other, rather a lot more complicated and much worse issues in a model capable of advanced composition, beyond the obviously worsened harm to artists' communities.

Licensing restrictions are not unlike choking hazard warnings on a pack of jelly beans; It won't actually stop people from either choking on them by accident, or for that matter stuffing them up their noses. Even though either are probably less legally actionable. :p

Point being, the Pandora's Box is open as soon as the model weights are released, and it's not a good look.

There are also possible ways to reduce VRAM usage, often enabled by default on various local deployment platforms. The encoders could be unloaded, and possible sub-INT8 quantization could potentially reduce the model weight itself to a reasonable size, if they decide to release the 8B version as well.
Posted on Reply
#8
Minus Infinity
Ok please explain how after training with Exabytes of data, stable diffusion is still producing humans with 4 hands and 12 fingers, or a 6 legged dog. Just watched a video of latest Stable diffusion AI generated video and nearly all humans had glaring issues with arms/hands/legs depending on the pose.
Posted on Reply
#9
Wirko
Minus InfinityOk please explain how after training with Exabytes of data, stable diffusion is still producing humans with 4 hands and 12 fingers, or a 6 legged dog.
How could SD possibly know that these aren't average humans and animals?
Minus InfinityJust watched a video of latest Stable diffusion AI generated video and nearly all humans had glaring issues with arms/hands/legs depending on the pose.
Wow. Is the number of fingers at least constant through the videos?
Posted on Reply
#10
JWNoctis
WirkoHow could SD possibly know that these aren't average humans and animals?
"photography, many-faced many-armed statue of the Hindu pantheon", "majolica, triskelion, winged angelic head", and "sable a six-legged canine regardant, breathing flame gules, rotated 7 degrees counterclockwise from horizontal" would probably be actually quite distinctive in the dataset, if present at all. It's all in the labels.

Now I'm not an expert at this, but IMHO what happened here is likely some or all of these things, and certainly something else too:

First, they filtered their dataset based on labels, and excluded even relatively tame keywords of remotely possible misuse like "lying", among the more obviously questionable ones. This also filtered out the majority of the data associated with difficult but usually safe anatomical details, like hands.

Furthermore, they (EDIT: could have) included a poisoned dataset with labels associated with those concepts, made up of intentionally scrambled images, perhaps to impede finetuning to undo those limits. Though I wonder whether they actually went that far.

At last, perhaps they did not train on a large dataset to begin with, due to copyright, quality, and/or budgetary concerns, and overfitted.
Posted on Reply
#11
InVasMani
Minus InfinityOk please explain how after training with Exabytes of data, stable diffusion is still producing humans with 4 hands and 12 fingers, or a 6 legged dog. Just watched a video of latest Stable diffusion AI generated video and nearly all humans had glaring issues with arms/hands/legs depending on the pose.
I'd say much of it has to do with it being trash. No it's probably attributed in no small part due to the negative prompt modifiers like bad anatomy, disfigured, blurred, butt on face, ect...

If you want to train it right with consistency I think you'd need to start with hand, fist, grip, and claw and get consistency out of each and then use the appropriate one for whatever you're intended scene is. Additionally left and right on those. I'm sure a Lora on each of those would help though you can still train a collection without a Lora it's just more difficult with less tuning since you can't also weight control the results vary the results if you need to or want to to blend and fuse things together in the right ways you intend to.

In the end draw me a perfect ect isn't really going to work at this point especially when talking about something unconventional that doesn't even have a dataset in the first place like animals that are human like and also hybrid various manners. DALLE-3 does a pretty decent job overall though and you can usually work with it pretty well to something close to what you were intending out of it. You know like groundhogs with sledge hammers they be sledging hard at work in the underground they got the Mjölnir might they ain't no sucka whack a moles...
Posted on Reply
#12
Minus Infinity
WirkoHow could SD possibly know that these aren't average humans and animals?



Wow. Is the number of fingers at least constant through the videos?
Not sure I paid close enough attention, but I don't think so. As they shifted their arms, new arms appeared or disappeared and when their hands got close together the fingers fused!
Posted on Reply
Add your own comment
Dec 21st, 2024 12:10 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts