Evaluating GPT-3 and GPT-4 on the Winograd Schema Challenge (Reasoning Test)

Denis Kazakov
3 min readMar 16, 2023

Just a little fun benchmarking new ChatGPT on confusing sentences.

Result

I found that GPT-4 significantly outperforms GPT-3 on the Winograd Schema Challenge. Specifically,

  • GPT-4 got an accuracy of 94.4%,
  • GPT-3 got 68.8%. *
  • Random baseline is 50% (since there are always only 2 options)

On the WSC285 Winograd set. Super impressive!

All the code/answers are in https://github.com/d-kz/gpt_winograd/blob/master/all.csv

Winograd Challenge

The Winograd Schema Challenge is a task used to evaluate natural language processing models. It gives an ambiguous sentence, which is tricky to understand without having general knowledge of how the world works and using that knowledge to resolve the ambiguity.

For example,

“The man couldn’t lift his son because he was so weak.” In ‘he was so weak’, does ‘he’ refer to ‘the man’ or ‘the son’?

  • We know it can be difficult to lift somebody up if you are weak. Since ‘the man’ is lifting his son, we can assume it’s ‘the man’ that’s weak and not the son.

“Dan took the rear seat while Bill claimed the front because his “”Dibs!”” was slow.” Whose ‘dibs’ was slow? Bill’s or Dan’s.

  • To know that Dan was too slow with his Dibs, we need to know that front seats are the desirable ones. Otherwise, we can’t know who was too slow.

To make such logical reasoning, the model needs to both:

  1. have a good understanding of the world and
  2. be able to relate that understanding to the context it is presented with.

Prompting process

  • ChatGPT UI was used to feed GPT models data in batches of 50 using the prompt:
You will receive rows of data. Each column is separated by ';' symbol. Columns are as following:"text";"pronoun";"quote";"options". 

Your job is to answer each row of data with the following question. What does "pronoun" in "quote" refer to in the "text", given "options"? Choose your answer from "options".

Pay particular attention to ambiguities and try to infer the answer using your knowledge of how the world works to get the right answer. Think it over three times before giving your answer.

Make sure your output is only one of the "options". Provide your answers as a list. don't repeat the question, only give answers. I repeat, do not repeat the question or give your reasoning, just give the answer.
  • The context had to be reset after each batch (start new conversation) to avoid degradation.

Assumptions

GPT models are trained on a lot of data and we can only assume it didn’t cheat and just recite the answers to the Winograd challenge. The reasoning it gave when explaining itself seems like it didn’t though.

Feel free to replicate this evaluation with a completely new (i.e. unseen by GPT model) Winograd challenge set, but for now, you can just take GPT’s word for it?!:)

Switching the gender on the question still gets the right answer.

“Explain yourself step-by-step”

Prompting the models to explain their logic made GPT-4 correct its mistake, while GPT-3 remained firm with its initial decision.

GPT-4 corrects itself (LEFT), GPT-3 fails to correct itself (RIGHT)

--

--