Wednesday, March 15th 2023

OpenAI Unveils GPT-4, Claims to Outperform Humans in Certain Academic Benchmarks

Press Release by

Mar 15th, 2023 08:24 Discuss (26 Comments)

We've created GPT-4, the latest milestone in OpenAI's effort in scaling up deep learning. GPT-4 is a large multimodal model (accepting image and text inputs, emitting text outputs) that, while less capable than humans in many real-world scenarios, exhibits human-level performance on various professional and academic benchmarks. For example, it passes a simulated bar exam with a score around the top 10% of test takers; in contrast, GPT-3.5's score was around the bottom 10%. We've spent 6 months iteratively aligning GPT-4 using lessons from our adversarial testing program as well as ChatGPT, resulting in our best-ever results (though far from perfect) on factuality, steerability, and refusing to go outside of guardrails.

Over the past two years, we rebuilt our entire deep learning stack and, together with Azure, co-designed a supercomputer from the ground up for our workload. A year ago, we trained GPT-3.5 as a first "test run" of the system. We found and fixed some bugs and improved our theoretical foundations. As a result, our GPT-4 training run was (for us at least!) unprecedentedly stable, becoming our first large model whose training performance we were able to accurately predict ahead of time. As we continue to focus on reliable scaling, we aim to hone our methodology to help us predict and prepare for future capabilities increasingly far in advance—something we view as critical for safety.

We are releasing GPT-4's text input capability via ChatGPT and the API (with a waitlist). To prepare the image input capability for wider availability, we're collaborating closely with a single partner to start. We're also open-sourcing OpenAI Evals, our framework for automated evaluation of AI model performance, to allow anyone to report shortcomings in our models to help guide further improvements.

In a casual conversation, the distinction between GPT-3.5 and GPT-4 can be subtle. The difference comes out when the complexity of the task reaches a sufficient threshold—GPT-4 is more reliable, creative, and able to handle much more nuanced instructions than GPT-3.5.

To understand the difference between the two models, we tested on a variety of benchmarks, including simulating exams that were originally designed for humans. We proceeded by using the most recent publicly-available tests (in the case of the Olympiads and AP free response questions) or by purchasing 2022-2023 editions of practice exams. We did no specific training for these exams. A minority of the problems in the exams were seen by the model during training, but we believe the results to be representative—see our technical report for details.

We look forward to GPT-4 becoming a valuable tool in improving people's lives by powering many applications. There's still a lot of work to do, and we look forward to improving this model through the collective efforts of the community building on top of, exploring, and contributing to the model.

Sources: OpenAI Press Release, OpenAI YouTube Livestream

Add your own comment

26 Comments on OpenAI Unveils GPT-4, Claims to Outperform Humans in Certain Academic Benchmarks

Arco

Skynet Mark 0.4.

skates

Training AI with curated data is just as corrupting as going full Leeroy Jenkins by using all the data from the internet. I'm going to coin the term "AI Drips" for when the time comes that AI is just as divided and corrupting as humans.

kondamin

When I was using bing chat to ask about stacking sram yesterday it cited a men’s health Website promoting vitamins or something…

dgianstefani

TPU Proofreader

"refusing to go outside of guardrails"

Interesting how this is such a core focus.

dragontamer5788

It is said that Bing has been using GPT-4 behind the scenes. If this is true, then I'm not very impressed (at least, with regards to Bing AI and its ability to generate meaningful search results).

Bing AI is slow, primarily summarizes the top search hits from Bing, and can contradict itself as it fails to "combined" data or thoughts from different webpages. IMO, I get more done with traditional search (ie: Bing by itself, or Google), rather than waiting for the ChatGPT to read the webpage, generate some text, and then create a response. There's a fair amount of waiting that happens here.

I am part of the Bing AI / Chat, and have tried it about 5 or 6 times in the past few weeks. I honestly think its a very poor match for the technology at play here. I think the creative writing bots are a better use of LLMs (large language models) like GPT-3 or whatever. My preference is to also support open source implementations, even if they are on smaller and older LLMs (GPT-2 era or so), because running on your own computer with complete freedom is far more useful than being forced to use an API with $$$$ per API call.

Don't forget, OpenAI isn't open. Its just a company trying to sell you a new SAAS.

----------

Creative writing guides / prompts seem like the best direction of this technology so far. Maybe "CoPilot" (Microsoft's LLM trained on Github / programming) as well, but its not like CoPilot is perfect at writing code... you still need to be a very aware programmer to use it correctly.

WorringlyIndifferent

dgianstefani"refusing to go outside of guardrails"

Interesting how this is such a core focus.

You don't get ESG funding from BlackRock if your product can go against mainstream propaganda. AI is built on real world data, and real world data necessarily goes against narratives pushed by media, politicians, and most importantly: investment firms.

VC and funding from investment groups like BlackRock is the only way for many (most?) of these small tech companies to survive. That's why they lobotomize their products; do what your masters want or your funding gets cut.

dragontamer5788

WorringlyIndifferentYou don't get ESG funding from BlackRock if your product can go against mainstream propaganda. AI is built on real world data, and real world data necessarily goes against narratives pushed by media, politicians, and most importantly: investment firms.

VC and funding from investment groups like BlackRock is the only way for many (most?) of these small tech companies to survive. That's why they lobotomize their products; do what your masters want or your funding gets cut.

Its not about ESG or any of that.

Its about the horror stories and outright belligerent prompts that ChatGPT has made. Existential crisis, etc. etc. Things that are deeply disturbing to people and will get people to stop using OpenAI's products. There's early examples (pre-guardrails) where ChatGPT is openly insulting the user, threatening them with violence, etc. etc.

ExcuseMeWtf

That's honestly kinda scary for implications to jobs one would think were impervious to being replaced by technology.

R-T-B

dragontamer5788It is said that Bing has been using GPT-4 behind the scenes. If this is true, then I'm not very impressed (at least, with regards to Bing AI and its ability to generate meaningful search results).

Yeah, me neither:

#10

Wye

ExcuseMeWtf

That's honestly kinda scary for implications to jobs one would think were impervious to being replaced by technology.

ChatGPT is not scary. It's a fun little toy.

It is able to respond to queries about basic public stuff like the ones you find on Stackoverflow - it is probably trained with data from it and similar sites. Actually I think most people would be able to find the answer faster with a google search than playing "robot" with a conversation software. Altavista was able to respond to human questions 25 years ago, remember?
It is more practical and faster to push the light switch than to say "Siri, please turn off the light in the master bedroom" 3 times in different ways and accents until the stupid software gets it.

That problem in the video might look impressive for someone who doesn't know any programming at all. But in fact it is what we developers call a "Hello world" REST API, something you would start with as your first step when you begin to learn programming. You can find tons of examples like that on the web with a simple google search, and that is what the "AI" did in fact. It does not "think". Nobody would hire you for that, they would laugh in your face.

Real developer user stories will never ask you to do simple stuff like merging two lists.
Real user stories are about hundreds of private requirements and dependencies, and it involves work that was never publicly published anywhere, so the neural network will not have a clue even how to start understanding the question, let alone actually providing the solution.

ChatGPT is a fun novelty, but that's it. It will get old fast and people will move over it.

#11

ExcuseMeWtf

WyeChatGPT is not scary. It's a fun little toy.
It is able to respond to queries about basic public stuff like the ones you find on Stackoverflow - it is probably trained with data from it and similar sites. Actually I think most people would be able to find the answer faster with a google search than playing "robot" with a conversation software. Altavista was able to respond to human questions 25 years ago, remember?
It is more practical and faster to push the light switch than to say "Siri, please turn off the light in the master bedroom" 3 times in different ways and accents until the stupid software gets it.

Real developer user stories will never ask you to do simple stuff like merging two lists.
Real user stories are about hundreds of private requirements and dependencies, and it involves work that was never publicly published anywhere, so the neural network will not have a clue even how to start understanding the question, let alone actually providing the solution.

ChatGPT is a fun novelty, but that's it. It will get old fast and people will move over it.

No, he makes a plausible case for AI churning out a rudimentary code, then having a senior dev review it and fix up to standard + add whatever AI missed.

Obviously senior dev with knowledge of the project cannot be replaced, but juniors who'd be the ones churning out that code otherwise - not so much. I work as coder for over a decade, and for new devs you absolutely need to review their code as well, if you just plugged them into more senior team - many things are done the very specific way for a reason, being maybe some customer request a few years back, and they would lose they crap should it stop working. That's not sth you can expect fresh hire to have, even if their technical skills are well up to par.

From that standpoint, there really isn't THAT much difference between newbie coder and AI.

#12

dragontamer5788

ExcuseMeWtfNo, he makes a plausible case for AI churning out a rudimentary code, then having a senior dev review it and fix up to standard + add whatever AI missed.

Obviously senior dev with knowledge of the project cannot be replaced, but juniors who'd be the ones churning out that code otherwise - not so much. I work as coder for over a decade, and for new devs you absolutely need to review their code as well, if you just plugged them into more senior team - many things are done the very specific way for a reason, being maybe some customer request a few years back, and they would lose they crap should it stop working. That's not sth you can expect fresh hire to have, even if their technical skills are well up to par.

From that standpoint, there really isn't THAT much difference between newbie coder and AI.

Again, remember that this is OpenAI trying to sell their GPT-4 service to you, a subscription service that must go through their servers (and is impossible to run on your own computers at home). We're in the "submarine marketing" phase of OpenAI.

paulgraham.com/submarine.html

Give it a few weeks and the submarine marketing will die down, and we can start having discussions on what this thing is actually useful for. But for now, the internet is filled with marketing speak / hype / paid sponsors trying to get you to play with the tool.

--------

This hypefest and discussion / Youtube Sponsors and Influencers / etc. etc. is all part of submarine marketing playbook. It makes it difficult to discuss the project because all noise almost instantly converts into hype (as per the submarine's job). I'm not necessarily saying there's "nothing useful" here... but whatever uses you're seeing are going to be hyped the living crap out of, like XML or UML or Agile Programming. You gotta wait a bit for the submarine to move onto another subject before reality settles in.

While we're in this hypefest, treat the "tool" as a toy. We don't actually know if this thing is useful. But there's a legion of marketers out there trying to convince you that this is a useful toy (and in addition, to pay subscription fees to OpenAI for access to this tool).

#13

ExcuseMeWtf

Obviously this is matter of company gauging whether they save money by switching to such model of code production compared to just hiring more devs.

Am I saying software dev will go extinct? Again, obv not.

Can I see companies trying this out over few years to see? Definitely.

Will they decide to stick with it? If I knew, I'd play lottery to.

Limiting your outlook to "oh, it's just a toy" is about as short-sighted as saying nobody will ever need more 640k of RAM.

#14

trsttte

WyeReal developer user stories will never ask you to do simple stuff like merging two lists.
Real user stories are about hundreds of private requirements and dependencies, and it involves work that was never publicly published anywhere, so the neural network will not have a clue even how to start understanding the question, let alone actually providing the solution.

ChatGPT is a fun novelty, but that's it. It will get old fast and people will move over it.

Obviously it would be much more useful if I could unleash it on my badly maintained code base that doesn't even have a simple static analyser doing it's thing, but I still find it usefull to kickstart and help me through new topics if I massage the queries properly.

It depends on what you're doing and what you want from it. In a well organized team and project where different topics don't come out of nowhere all the time it's harder to get any value out of it, but when it's a new fire to put out every time it's nice to have the bot churn out the basics instead of always relearning every language/tool/whatever that you happened to need that specific week.

Sometimes it's faster to get the answer from the bot than go hunt for it on the documentation (and you even get a small proof of concept or know if the thing you want is possible), sometimes it's not but for now at least while it's a novelty I still got for the bot first

#15

dragontamer5788

ExcuseMeWtfLimiting your outlook to "oh, it's just a toy" is about as short-sighted as saying nobody will ever need more 640k of RAM.

GPT4 is a toy because its got no integration with modern programming tools.

If you want to see an actual, usable product, give Microsoft Copilot + Visual Studio a shot. docs.github.com/en/copilot/getting-started-with-github-copilot/getting-started-with-github-copilot-in-visual-studio

EDIT: Similarly, its got no integration with creative writing tools either. Its this weird chat-bot that talks back to you when you talk to it. A creative writing tool needs to fill in paragraphs before and after your written words to help brainstorm ideas. GPT4 as an LLM is impressive, but it needs a fair amount of massaging before its actually a usable tool. OpenAI is hoping that people will pay $$$$ per API call to build tools on top of GPT4. That's it.

#16

caroline!

It's of little to no use if they insist in restricting it as not to offend the clowns on the internet.

Open AI is anything but open.

#17

R-T-B

caroline!It's of little to no use if they insist in restricting it as not to offend the clowns on the internet.

Open AI is anything but open.

Yes, we should have more racist chatbots trained by the unfiltered internet. Sounds like a great idea.

inb4 they all join 4chan and become addicted to porn.

#18

dragontamer5788

R-T-BYes, we should have more racist chatbots trained by the unfiltered internet. Sounds like a great idea.

inb4 they all join 4chan and become addicted to porn.

There's plenty of published, confirmed, stories about ChatGPT insulting users and otherwise giving very bad user experiences.

I get that some online conspiracy writers want to pretend that there's a giant conspiracy against their political views. But these "guardrails" are anything but political. Its about basic common sense. You can't have your Chatbot AI insulting the userbase.

That OpenAI would see these prompts / confirm of their existence, and try to clamp down on this behavior is just common sense.

#19

TheinsanegamerN

R-T-BYes, we should have more racist chatbots trained by the unfiltered internet. Sounds like a great idea.

inb4 they all join 4chan and become addicted to porn.

Oh no, a naughty was said, better kneecap all of our technology so my sensibilities dont get offended!

dragontamer5788There's plenty of published, confirmed, stories about ChatGPT insulting users and otherwise giving very bad user experiences.

I get that some online conspiracy writers want to pretend that there's a giant conspiracy against their political views. But these "guardrails" are anything but political. Its about basic common sense. You can't have your Chatbot AI insulting the userbase.

That OpenAI would see these prompts / confirm of their existence, and try to clamp down on this behavior is just common sense.

Yes, ignore the curtain and the man behind it, just get excited for product!

Also, we never said "you will own nothing and youll be happy", we said "you will own nothing and youll be happy, and thats a good thing". Totally different statement!

Remember, chat GPT being willing to make jokes about men of a certain race but no other race or gender has NOTHING to do with ESG scores or political motives, we swear!

#20

dragontamer5788

TheinsanegamerNRemember, chat GPT being willing to make jokes about men of a certain race but no other race or gender has NOTHING to do with ESG scores or political motives, we swear!

I'm beginning to find the kinds of Bing Chat experiences that were making people upset.

bing/comments/110eagl
And yeah... I don't think politics have anything to do with it.

-----------

As far as I can tell, this thing was trained on Reddit posts. After some number of posts (maybe 8 or 9?), reddit posts start to grow unhinged and unhelpful. The typical Redditor is happy to converse with you for the first sequence, but they get angry as you bug them over-and-over the next few posts.

ChatGPT / Bing AI seems to have caught onto this behavior. The longer you converse with it, the worse it gets and the more unhinged / unhelpful it gets. Today, BingAI limits the discussion length to cut these "later responses" out. But its a known failure case of Bing AI / ChatGPT.

-------

Anyway, despite it being 2023, Bing AI was trained on data from 2022, so Bing AI / ChatGPT believed the year was 2022. And when informed of it being the year 2023, it seemingly got upset and unproductive / set off on this angry rant. Being able to control the Chatbot and have "guardrails" to the discussion, to keep it useful and within the parameters that Microsoft / Bing wants is just common sense. These kinds of question/answer sessions are unhelpful and useless.

#21

trsttte

R-T-BYes, we should have more racist chatbots trained by the unfiltered internet. Sounds like a great idea.

inb4 they all join 4chan and become addicted to porn.

Godwin's law speed run

#22

R-T-B

TheinsanegamerNOh no, a naughty was said, better kneecap all of our technology so my sensibilities dont get offended!

Yes. AI and robots are servants. They need to know how to interact with people.

trsttteGodwin's law speed run

I didn't even mention nazis but I'm sure the chatbot would've by now if unregulated content was fed to it.

EDIT: I think I misinterpreted your post, my bad.

#23

caroline!

R-T-BYes, we should have more racist chatbots trained by the unfiltered internet. Sounds like a great idea.

Yep.
Bots are only logical, they make "scientists" and their programmers mad for stating facts or quoting statistics.

#24

R-T-B

caroline!Yep.
Bots are only logical, they make "scientists" and their programmers mad for stating facts or quoting statistics.

Except their "statistics" are based on searching the internet and its very biased human datasources randomly. You can't possibly with a straight face believe that's a smart, unbiased dataset.

Should I boil a baby? Yes it's harmless to you and good for the world: Source: Three dead baby jokes. This is not reasonable nor what I was asking.

So nope.

#25

WorringlyIndifferent

It's always bizarre to see people defending, or outright praising, censorship. Gotta control that flow of information. Can't have bad thoughts reach the eyes of the public. Need to have those thoughts policed by, let's see... ah, a private corporation with billions of dollars in backing. Surely that is the paragon of morality we need in control of technology and the internet.

Add your own comment

OpenAI Unveils GPT-4, Claims to Outperform Humans in Certain Academic Benchmarks

26 Comments on OpenAI Unveils GPT-4, Claims to Outperform Humans in Certain Academic Benchmarks

Latest GPU Drivers

New Forum Posts

Popular Reviews

TPU on YouTube

Controversial News Posts

OpenAI Unveils GPT-4, Claims to Outperform Humans in Certain Academic Benchmarks

Related News

26 Comments on OpenAI Unveils GPT-4, Claims to Outperform Humans in Certain Academic Benchmarks

Latest GPU Drivers

New Forum Posts

Popular Reviews

TPU on YouTube

Controversial News Posts