• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Stability AI Outs Stable Diffusion 3 Medium, Company's Most Advanced Image Generation Model

AleksandarK

News Editor
Staff member
Joined
Aug 19, 2017
Messages
2,582 (0.97/day)
Stability AI, a maker of various generative AI models and the company behind text-to-image Stable Diffusion models, has released its latest Stable Diffusion 3 (SD3) Medium AI model. Running on two billion dense parameters, the SD3 Medium is the company's most advanced text-to-image model to date. It boasts features like generating highly realistic and detailed images across a wide range of styles and compositions. It demonstrates capabilities in handling intricate prompts that involve spatial reasoning, actions, and diverse artistic directions. The model's innovative architecture, including the 16-channel variational autoencoder (VAE), allows it to overcome common challenges faced by other models, such as accurately rendering realistic human faces and hands.

Additionally, it achieves exceptional text quality, with precise letter formation, kerning, and spacing, thanks to the Diffusion Transformer architecture. Notably, the model is resource-efficient, capable of running smoothly on consumer-grade GPUs without compromising performance due to its low VRAM footprint. Furthermore, it exhibits impressive fine-tuning abilities, allowing it to absorb and replicate nuanced details from small datasets, making it highly customizable for specific use cases that users may have. Being an open-weight model, it is available for download on HuggingFace, and it has libraries optimized for both NVIDIA's TensorRT (all modern NVIDIA GPUs) and AMD Radeon/Instinct GPUs.



View at TechPowerUp Main Site | Source
 
Joined
Feb 20, 2019
Messages
8,278 (3.93/day)
System Name Bragging Rights
Processor Atom Z3735F 1.33GHz
Motherboard It has no markings but it's green
Cooling No, it's a 2.2W processor
Memory 2GB DDR3L-1333
Video Card(s) Gen7 Intel HD (4EU @ 311MHz)
Storage 32GB eMMC and 128GB Sandisk Extreme U3
Display(s) 10" IPS 1280x800 60Hz
Case Veddha T2
Audio Device(s) Apparently, yes
Power Supply Samsung 18W 5V fast-charger
Mouse MX Anywhere 2
Keyboard Logitech MX Keys (not Cherry MX at all)
VR HMD Samsung Oddyssey, not that I'd plug it into this though....
Software W10 21H1, barely
Benchmark Scores I once clocked a Celeron-300A to 564MHz on an Abit BE6 and it scored over 9000.
woohoo, a focus on VRAM footprint.

Nvidia's paltry VRAM allocations are a big hurdle for creators just wanting to test out AI and VRAM optimisation is welcome.
 
Joined
Jul 13, 2016
Messages
3,279 (1.07/day)
Processor Ryzen 7800X3D
Motherboard ASRock X670E Taichi
Cooling Noctua NH-D15 Chromax
Memory 32GB DDR5 6000 CL30
Video Card(s) MSI RTX 4090 Trio
Storage Too much
Display(s) Acer Predator XB3 27" 240 Hz
Case Thermaltake Core X9
Audio Device(s) Topping DX5, DCA Aeon II
Power Supply Seasonic Prime Titanium 850w
Mouse G305
Keyboard Wooting HE60
VR HMD Valve Index
Software Win 10
Probably the worst Stable Diffusion release to date. It's god awful at doing anatomy, particularly females. Worse than SD1.5 bad. The licensing changes are killing the fine-tuning community to boot.

It might be good if you only do concept art, graphics, ect but given that there won't be nearly as many fine tunes that means there won't be as many creative options as well.

woohoo, a focus on VRAM footprint.

Nvidia's paltry VRAM allocations are a big hurdle for creators just wanting to test out AI and VRAM optimisation is welcome.

They mean in regards to full 8B model. SD3 2B (this release) uses 16.3GB of VRAM without any LORA or IPAdapter which means you are going to want at least a 4080 for best performance. The vast majority of Nvidia users, including those of their upcoming cards if rumors are correct, will be SOL VRAM wise.
 
Joined
Oct 15, 2004
Messages
189 (0.03/day)
Location
Peterborough, UK
System Name IONE
Processor AMD Ryzen 9 5900X
Motherboard ASUS STRIX B550-A Gaming
Cooling Noctua NH-U12S SE-AM4
Memory 128GB (4x32GB) Corsair DDR4 Vengeance LPX Black, PC4-25600 (3200), CMK128GX4M4E3200C16
Video Card(s) PNY GeForce RTX 3080 12GB
Storage Samsung 980 1TB NVMe (system), Lexar NM790 4TB NVMe (temp), 16x Seagate IronWolf 10TB RAID6
Display(s) Dell UP3017
Case Lian-Li PC-777B
Audio Device(s) Focal Alpha 65 Evo
Power Supply Corsair AX1200
Mouse Logitech M510
Keyboard Keychron Q10, brass plate, Kailh Box Summer switches and PBT Cherry keycaps
Software Xubuntu 22.04
Benchmark Scores N/A
Can it draw human hands with say all 5 fingers in the right locations?
 
Joined
Jul 13, 2016
Messages
3,279 (1.07/day)
Processor Ryzen 7800X3D
Motherboard ASRock X670E Taichi
Cooling Noctua NH-D15 Chromax
Memory 32GB DDR5 6000 CL30
Video Card(s) MSI RTX 4090 Trio
Storage Too much
Display(s) Acer Predator XB3 27" 240 Hz
Case Thermaltake Core X9
Audio Device(s) Topping DX5, DCA Aeon II
Power Supply Seasonic Prime Titanium 850w
Mouse G305
Keyboard Wooting HE60
VR HMD Valve Index
Software Win 10
Can it draw human hands with say all 5 fingers in the right locations?

Nope, it has big issues with that. Both SD1.5 and SDXL are better in that regard (and both of those are far from perfect at hands as well).
 
Joined
May 13, 2010
Messages
6,068 (1.14/day)
System Name RemixedBeast-NX
Processor Intel Xeon E5-2690 @ 2.9Ghz (8C/16T)
Motherboard Dell Inc. 08HPGT (CPU 1)
Cooling Dell Standard
Memory 24GB ECC
Video Card(s) Gigabyte Nvidia RTX2060 6GB
Storage 2TB Samsung 860 EVO SSD//2TB WD Black HDD
Display(s) Samsung SyncMaster P2350 23in @ 1920x1080 + Dell E2013H 20 in @1600x900
Case Dell Precision T3600 Chassis
Audio Device(s) Beyerdynamic DT770 Pro 80 // Fiio E7 Amp/DAC
Power Supply 630w Dell T3600 PSU
Mouse Logitech G700s/G502
Keyboard Logitech K740
Software Linux Mint 20
Benchmark Scores Network: APs: Cisco Meraki MR32, Ubiquiti Unifi AP-AC-LR and Lite Router/Sw:Meraki MX64 MS220-8P
I use SD thru a mobile app and it won't render anime catgirls wearing leather pants but it allows rendering of very nasty stuff that is too hot for this site (in a bad way)

Used to but this update made it not be able to
 
Joined
May 22, 2024
Messages
411 (2.20/day)
System Name Kuro
Processor AMD Ryzen 7 7800X3D@65W
Motherboard MSI MAG B650 Tomahawk WiFi
Cooling Thermalright Phantom Spirit 120 EVO
Memory Corsair DDR5 6000C30 2x48GB (Hynix M)@6000 30-36-36-76 1.36V
Video Card(s) PNY XLR8 RTX 4070 Ti SUPER 16G@200W
Storage Crucial T500 2TB + WD Blue 8TB
Case Lian Li LANCOOL 216
Power Supply MSI MPG A850G
Software Ubuntu 24.04 LTS + Windows 10 Home Build 19045
Benchmark Scores 17761 C23 Multi@65W
Probably the worst Stable Diffusion release to date. It's god awful at doing anatomy, particularly females. Worse than SD1.5 bad. The licensing changes are killing the fine-tuning community to boot.

It might be good if you only do concept art, graphics, ect but given that there won't be nearly as many fine tunes that means there won't be as many creative options as well.

They mean in regards to full 8B model. SD3 2B (this release) uses 16.3GB of VRAM without any LORA or IPAdapter which means you are going to want at least a 4080 for best performance. The vast majority of Nvidia users, including those of their upcoming cards if rumors are correct, will be SOL VRAM wise.
To be fair, the model does show composition and spelling capabilities previously unseen in any open-weight model. That rather overwrought safety engineering could theoretically be moderated or entirely reversed, even given the licensing restrictions, though that leads to other, rather a lot more complicated and much worse issues in a model capable of advanced composition, beyond the obviously worsened harm to artists' communities.

Licensing restrictions are not unlike choking hazard warnings on a pack of jelly beans; It won't actually stop people from either choking on them by accident, or for that matter stuffing them up their noses. Even though either are probably less legally actionable. :p

Point being, the Pandora's Box is open as soon as the model weights are released, and it's not a good look.

There are also possible ways to reduce VRAM usage, often enabled by default on various local deployment platforms. The encoders could be unloaded, and possible sub-INT8 quantization could potentially reduce the model weight itself to a reasonable size, if they decide to release the 8B version as well.
 
Last edited:
Joined
May 3, 2018
Messages
2,881 (1.20/day)
Ok please explain how after training with Exabytes of data, stable diffusion is still producing humans with 4 hands and 12 fingers, or a 6 legged dog. Just watched a video of latest Stable diffusion AI generated video and nearly all humans had glaring issues with arms/hands/legs depending on the pose.
 
Joined
Jan 3, 2021
Messages
3,491 (2.46/day)
Location
Slovenia
Processor i5-6600K
Motherboard Asus Z170A
Cooling some cheap Cooler Master Hyper 103 or similar
Memory 16GB DDR4-2400
Video Card(s) IGP
Storage Samsung 850 EVO 250GB
Display(s) 2x Oldell 24" 1920x1200
Case Bitfenix Nova white windowless non-mesh
Audio Device(s) E-mu 1212m PCI
Power Supply Seasonic G-360
Mouse Logitech Marble trackball, never had a mouse
Keyboard Key Tronic KT2000, no Win key because 1994
Software Oldwin
Ok please explain how after training with Exabytes of data, stable diffusion is still producing humans with 4 hands and 12 fingers, or a 6 legged dog.
How could SD possibly know that these aren't average humans and animals?
1718355709191.png
1718355766120.png
1718355830844.png


Just watched a video of latest Stable diffusion AI generated video and nearly all humans had glaring issues with arms/hands/legs depending on the pose.
Wow. Is the number of fingers at least constant through the videos?
 
Joined
May 22, 2024
Messages
411 (2.20/day)
System Name Kuro
Processor AMD Ryzen 7 7800X3D@65W
Motherboard MSI MAG B650 Tomahawk WiFi
Cooling Thermalright Phantom Spirit 120 EVO
Memory Corsair DDR5 6000C30 2x48GB (Hynix M)@6000 30-36-36-76 1.36V
Video Card(s) PNY XLR8 RTX 4070 Ti SUPER 16G@200W
Storage Crucial T500 2TB + WD Blue 8TB
Case Lian Li LANCOOL 216
Power Supply MSI MPG A850G
Software Ubuntu 24.04 LTS + Windows 10 Home Build 19045
Benchmark Scores 17761 C23 Multi@65W
How could SD possibly know that these aren't average humans and animals?
"photography, many-faced many-armed statue of the Hindu pantheon", "majolica, triskelion, winged angelic head", and "sable a six-legged canine regardant, breathing flame gules, rotated 7 degrees counterclockwise from horizontal" would probably be actually quite distinctive in the dataset, if present at all. It's all in the labels.

Now I'm not an expert at this, but IMHO what happened here is likely some or all of these things, and certainly something else too:

First, they filtered their dataset based on labels, and excluded even relatively tame keywords of remotely possible misuse like "lying", among the more obviously questionable ones. This also filtered out the majority of the data associated with difficult but usually safe anatomical details, like hands.

Furthermore, they (EDIT: could have) included a poisoned dataset with labels associated with those concepts, made up of intentionally scrambled images, perhaps to impede finetuning to undo those limits. Though I wonder whether they actually went that far.

At last, perhaps they did not train on a large dataset to begin with, due to copyright, quality, and/or budgetary concerns, and overfitted.
 
Last edited:
Joined
Mar 21, 2016
Messages
2,508 (0.79/day)
Ok please explain how after training with Exabytes of data, stable diffusion is still producing humans with 4 hands and 12 fingers, or a 6 legged dog. Just watched a video of latest Stable diffusion AI generated video and nearly all humans had glaring issues with arms/hands/legs depending on the pose.

I'd say much of it has to do with it being trash. No it's probably attributed in no small part due to the negative prompt modifiers like bad anatomy, disfigured, blurred, butt on face, ect...

If you want to train it right with consistency I think you'd need to start with hand, fist, grip, and claw and get consistency out of each and then use the appropriate one for whatever you're intended scene is. Additionally left and right on those. I'm sure a Lora on each of those would help though you can still train a collection without a Lora it's just more difficult with less tuning since you can't also weight control the results vary the results if you need to or want to to blend and fuse things together in the right ways you intend to.

In the end draw me a perfect ect isn't really going to work at this point especially when talking about something unconventional that doesn't even have a dataset in the first place like animals that are human like and also hybrid various manners. DALLE-3 does a pretty decent job overall though and you can usually work with it pretty well to something close to what you were intending out of it. You know like groundhogs with sledge hammers they be sledging hard at work in the underground they got the Mjölnir might they ain't no sucka whack a moles...
Designer (40).png
 
Last edited:
Joined
May 3, 2018
Messages
2,881 (1.20/day)
Top