In case you missed it the first two nights, tonight is the final human vs. machine match on Jeopardy! Watson, a question-answering AI three years in the making, has been pitted against two of the best human players of all time. And it is cleaning up.
As an AI guy, I feel a little like I’m watching the moon landing, and a little like somebody is showing embarrassing home movies.
First, the moon landing part. This is amazing stuff here, people. Do not be jaded by Google. There is a tremendous difference between retrieving something very related to your question, and actually answering a question. (Or in this case, posing the question; but even IBM calls the project “DeepQA,” for “Question-answering.”) Sentences are extraordinarily tricky, supple, varied things, and AI that attempts to understand natural language sentences using parse trees and deterministic rules usually falls flat on its face. The difference between “man bites dog” and “dog bites man” isn’t captured in a lot of search retrieval algorithms, but when Watson must understand a phrase such as “its largest airport is named for a World War II hero; its second largest, for a World War II battle”—no Google search for “world war II airport” is going to suffice. (Try it.)
In the cases where Watson has fallen down, as in the previous example, I think it’s generally been because of a failure to parse, or its version thereof; but it’s been remarkably resilient against extremely tricky phrasings. The first night, I was blown away by its answer to the Daily Double. The category was “Literary APB,” and the clue was what seemed an extremely sideways reference to Mr. Hyde: “Wanted for killing Sir Danvers Carew; appearance—pale & dwarfish; seems to have a split personality.” This is the kind of thing that can cause natural language processing (NLP) researchers fits if they’re trying to write code that parses the sentence.
What I didn’t notice the first time I saw the clue, though, was that “Sir Danvers Carew” was a dead giveaway to a machine with huge databases of text associations at its digital fingertips. It would be likely to point to other things in the classic book with extremely high confidence, by virtue of its commonly appearing near them in text. Of course, the machine must still understand that the correct answer is “Hyde” and not the book title or author or place—so its answer was still extremely impressive.
But the second night was on the whole less exciting than the first, precisely because there were fewer sideways references like this, and more “keyword” type answers. A whole category was devoted to providing the common name for an obscure medical term or its symptoms—easy for Watson, because its starting point for its searches is likely to be the most specific words in the clue. The Beatles lyrics category the first round was like this—every time a human chose it, I yelled at the screen, “Don’t do it! It’s a trap!” Still, even in this kind of clue, I was amazed at Watson’s breadth of phrase knowledge—the most remarkable being its knowing that “Isn’t that special” was a favorite saying of The Church Lady.
Okay, but about the embarrassing home movies. As much as we AI researchers are making fantastic progress in solving real problems in artificial cognition, we are fundamentally still too ready to hype, and believe our own hype. Watching the IBM infomercials the second night, which promised revolutions in medical science, prompted a mental montage of overoptimistic “Future Work” sections of papers and “Broader Impacts” sections of NSF grants. It’s how the work often gets funded, this maybe-you-could-use-this-to-save-babies kind of argument, but in many cases, it just seems like so much hot air. For one thing, the kinds of statistical reasoning that Watson presumably uses, called Bayesian networks, have been applied to medical diagnosis for quite a while, at least in academic work. What Watson really seems to be about is the same thing the chess playing Deep Blue was about—namely, raising the prestige of a technology consulting company.
And then there was the little matter that, shortly after the “we could use this for medicine” argument, Watson responded to the U.S. Cities question with “What is Toronto??????” This kind of thing is why AI people always show videos instead of doing live demos. It worked in testing, we swear! But it’s extremely difficult to catch this kind of thing beforehand in machine learning, precisely because the learner ultimately acquires more complexity than we put in.
Watson’s successes and failures both point to the fact that it was ultimately engineered by people. For example, the first night, when Ken Jennings got a question wrong, Watson acted as if it hadn’t heard Ken Jennings’ answer and just repeated it. I’m told the IBM team’s reaction was simply being surprised that Ken Jennings would ever get something wrong; they hadn’t counted on the possibility. It’s that brittleness that reminds us that Watson is ultimately a human triumph—it’s not a machine that’s up there, it’s a team of quite a few researchers pulling all-nighters in order to make something truly awesome. And in that way, it is like a moon landing.
The overall winner is apparently being determined by the sum of the dollar amounts of the two games—which is maybe too bad, because Watson’s carefully engineered bet-deciding mechanism now seems like it will go to waste. (Watson’s bets seem weirdly specific just because it’s presumably optimizing an expected payoff equation, one which may put different weights on winning versus winning more.) It seems unlikely that the humans will pull out an upset tonight if the questions are as keyphrase-able as the medical and Beatles categories of the previous nights. But who knows? Perhaps the producers have picked some questions that will require some tricky understanding of the sentences. Whatever Watson’s underlying algorithm, it still seems clear that it’s sometimes not actually understanding what the question is asking, but “going with its gut.” But more often than not, I’m extremely impressed with how well it handles the crazy sentence structures of Jeopardy! clues.
What’s hard for Watson is easy for us, and vice versa; but what’s hard or easy for Watson was surely hard for its team, and they deserve the highest kudos for this remarkable accomplishment.
Kevin Gold is an Assistant Professor in the Department of Interactive Games and Media at RIT. He received his Ph.D. in Computer Science from Yale University in 2008, and his B.A. from Harvard in 2001. When he is not thinking up new ideas for his research, he enjoys reading really good novels, playing geeky games, listening to funny, clever music, and reading the webcomics xkcd and Dresden Codak.
Yay! I’ve been watching the videos on my lunch break and am riveted. If you haven’t seen them yet, they’re all available from youtube user Rashad8821.
I heard about the Jeopardy! matchup while watching the documentary several days before on PBS. It sounded intriguing, so my family decided to watch.
I agree that the computer is “cleaning up”–despite its flaws (how the heck did it come up with Toronto in a “U.S. Cities” category, anyway?), and it will definitely take some fiendishly clever-worded questions tonight if the humans are to have any hope of catching up.
Bzzz™
By being programmed by Americans, maybe?
In effect, the machine showed the same level of geographic illiteracy as most Americans of my acquaintance.
Not that I expect it was a deliberate feature.
Watson is an incredible achievement, and a big step. But Jeopardy! is a flawed test, since it’s so dependent on who buzzes first.
According to Wikipedia, There are a total of eight Torontos in the US. It probably gave weight to a Toronto / US connection from that fact, then incorrectly cross-connected that to a clue that could be misinterpreted to point to Toronto, Ontario, Canada.
@@.-@ – agreed, the computer seemed to hold a definite advantage in buzzing in. I think it’s other major advantage was the fact that the questions were sent to it in a text file. I think it would be fairer for teh human players/a better test of the system if it had to use voice recognition to determine the question.
@7 Given my experience with speech recognition, it’s probably true that Watson would have performed worse if it had to do speech rec. However, a colleague of mine pointed out that if Watson had to sense things, it wouldn’t bother with speech — it would use OCR by reading the screen, which is what the contestants would do too. This would be basically done with perfect accuracy, and would, as my colleague put it, “be a whole lot of programming with no point.”
@8 – that’s a fair point. I don’t have a lot of experience with OCR. If Watson’s “eye” was a camera sitting at the contestant’s podium, pointed in the direction of the screen, is there an OCR program that could capture the text on screen perfectly?
Actually, whether or not OCR works that well isn’t too important, although the real question is whether it would slow down the computer as opposed to a text file sent to the machine while Alec was still reading the question (I think – not sure of the timing) . I mean, obviously Watson was able to parse the text and get an answer from (I’m assuming) his index as fast as human competitors, so voice or OCR reading wouldn’t slow it down too much. I’m really wondering if the method of asking Watson the questions gave him a half second (or maybe 2-3 seconds) of a head start…
Kevin – is Watson basically a search engine with better English skills?
What a great post.
@10 It’s a good point that doing any kind of sensing might slow Watson down a little … but I imagine, not much. It’s certainly not as hard as the crazy searches it’s doing.
The search engine characterization is apt insofar as it’s searching its own database for information and needs a fast way of accessing everything. But there’s also a fair amount of reasoning that it’s doing in addition to the English skills. For example, they’ve added some temporal and spatial reasoning, so that it can figure out that 400 years before 1898 is 1498, and that being within a state is also within a country, and so forth. It’s also actually been programmed to handle a fair amount of wordplay, such as the “before and after” categories where there are two answers linked by a word.
But I’d say the better English skills are most of the point from an AI point of view, and pretty darn impressive.
@11 Hey, thanks!
OCR shouldn’t help- reading is faster than listening to Alex Trebek, and often his Answer!question can be guessed, but not perfectly. This is seen when the button is pressed too early, and the reset delay is long enough to allow another contestant to jump the first contestant.
To push the Moon landing analogy a little farther, humans going into space hasn’t turned out to be the most valuable part of space exploration. At least for now, the big practical win was satellites looking at the earth, and communications satellites.
So, what spin-offs might come from Watson which aren’t attempts to emulate human abilities?
If the analogy is exact, could some of the best results be better abilities to look at ourselves?
https://docs.google.com/document/pub?id=1FEy86d9ONhHWGYvdtGqBvTwIgnhm_bfXG3pMfXgoBAg&pli=1
Peripherally related, but there’s a description of how much communication in gaming is improved if players have information about each others’ heart rates.
Is anyone else annoyed at the inane comments the news media keep making about this? Their recaps keep entirely missing the point.
@kevin:
As far as I know, Watson didn’t hear the answer–no speach recognition going on. So what really happened is that Watson happened to come up with the same answer as Ken. There was no immediate feedback to stop Watson from giving the answer.
This was indeed way cool. I was also amazed at how well Watson answered the “tricky” questions (like the Hyde one). There is some very interesting programming under the covers here.
I found it interesting that the second and third choice answers were often quite a bit off, even for questions where there might be fairly reasonable second choice answers.
Ken Jennings has a great article in Slate today. Very humble and funny, and gives some behind the scenes insights into the whole thing. Worth a read.
Watching a brief clip of the show, I have to wonder if Watson wasn’t limited in some way to make it more interesting. It seems to me that Watson could use superior reflex to ring in fisrt for every question and then take the 1-2 second time allotted to come up with an optimal answer, rather than waiting for reasonable certainty before ringing in. Note that Ken Jennings seems to be using this tactic against Watson.
The way the buzzing on Jeopardy works is that no one can buzz in until Alex finishes saying the answer. The way this works for humans is that Alex finishes talking, then a Jeopardy staff member presses a button that activates the buzzers and turns on some white lights to show the contestants that the buzzers are active. Since Watson can’t see or hear, when the Jeopardy staff member activates the lights, a signal is also sent to Watson, informing him (I keep anthropomorphising Watson in my mind). Watson rings in when he gets the signal and has a sufficient certainty level.
If a contestant rings in prior to the allowed time, their buzzer is deactivated for about a fifth of a second. Timing is everything.
It appears that Watson usually has pleanty of time to figure out the question as it receives the answer textually. Then, Watson is ready to buzz in as soon as it recieves the ready to buzz signal. This is a little bit of an advantage as the players get the input visually and or aurally and have to filter that and then react.
The computer was indeed using speech recognition. It was, in fact, a major part of the time necessary to set up this challenge. A documentary spent several minutes referring to this aspect of the programming, while never once mentioning anything related to OCR. As substantiating evidence, the reference to the IBM team’s surprise at Watson repeating Jenning’s answer was not about the computer saying what it had already heard, but that Ken got an answer wrong. All of the commentary in that thread presumes the condition that Watson hears the “answers” and the other contestants’ “questions”.
Speaking of trivia, who else knew that one of Ken Jennings’ college roomates at BYU was Brandon Sanderson?
On the US Cities/Toronto thing, according to an Ars Technical article, the programmers just said that they don’t trust category names very much. I think that they’re afraid that too many of them are jokes or puns that Watson can’t interpret, and they said that while they do give them some weight in answering the questions, they don’t give them much weight.
They said that if that question and category was rephrased as, “This US City’s largest airport is named for a World War II hero; its second largest, for a World War II battle,” it comes up with Chicago, the correct answer.
IBM claims that Watson has no speed advantage on buzzing in, but I don’t really find that convincing. Their argument is that since humans anticipate the light and start pressing the button fractions of seconds before they think the light will change, it’s okay for Watson to use machine-level reflexes. Which… meh.
Kevin Gold: are you by any chance the Kevin Gold that I used to play D&D with in the SF Bay Area?
Personally, I would have rather seen Watson as bachelor #2 on The Dating Game.
Freelancer@22:Nope, Watson doesn’t see or hear. There is natural language processing going on in processing the question text. No hearing or visual means invloved, however.
Here’s a link that is useful:
ibmresearch blog
Duplicate post
@23 In fact I am! Small world.
I knew that the system did not put much weight on the categories, but I hadn’t heard the bit about the rephrasing leading to the correct solution. Interesting!
@22 Watson was fed text during the actual competition, but I as well seem to recall at one point there was part of the team working on speech recognition. Given how difficult speech rec is when the words aren’t highly constrained by context, it is possible that this aspect was originally planned but scrapped due to poor performance. Anyway, they could have attempted to throw in speech rec for the contestants’ responses, but they chose not to; you can bet they would have added this input if they thought it would meaningfully contribute to Watson’s game.
I’d like to point out a couple of factors that seem to have produced an unfair contest, based on the facts presented here, on the shows, and in various other IBM-supported coverage:
1) Watson received the answers in text form. It had additional parsing to do, but with 2,880 CPU cores available, I expect it began running its thousands of parallel searches within mere milliseconds. Humans take quite a while just to recognize words and begin doing the real work– at least a couple of seconds for Jeopardy answers, I would guess. I believe we also allocate part of our minds to listening for hints in the way Trebek reads the answers. These differences in timing and processing are critical in game shows, but not so much in real life. (To be fair, I have noticed that when giving fundraising presentations to venture capital firms, direct and accurate split-second responses to silly questions are worth a lot…)
2) Watson received some kind of electronic notification when the Jeopardy scoring system was ready to accept a buzz-in. Presumably when Watson has already identified a response scoring above its pre-set threshold, the delay between the notification and Watson’s buzz is no more than a few milliseconds. Not only is this interval around two orders of magnitude faster than a human can push a button after seeing a light (200-300 milliseconds), it’s substantially shorter than the _uncertainty_ in the human delay from a visual stimulus to a physical response (>30 ms, I think). Last but not least, it completely eliminates any chance that Watson might buzz in too early and get locked out for the reported 200 ms interval. These are HUGE advantages.
It was clear, over and over again, that the human players were trying to buzz in at appreciably the same instant as Watson, but just couldn’t overcome Watson’s timing advantage.
The way this boils down for me is that the humans were doing MORE work (OCR and voice recognition plus the common requirements for pattern matching, analysis, ranking, etc.), had LESS time to do it, and had a 10:1 disadvantage in buzzing in on top of all that.
I would very much like to see a rematch in which Watson is required to play with OCR and voice recognition, and be limited (as humans are) to the combination of these two inputs as its only means of anticipating when the buzzers will go live.
. png
(Hi to Kevin Standlee. :-)