How does Google use human raters in web search?

How does Google use human raters in web search?


MATT CUTTS: Hey, everybody
Matt Cutts here, ready to answer another question
you’ve got about how Google search works. We’ve got a really interesting
one today. It’s from San Francisco, California. “Can you provide more details on how Google uses human raters as
part of their algorithm?” Great question. I’m going to try to narrow it
down a little bit first, so by human raters I assume you mean
people who are paid by Google, that is you’re not talking about
people who are blocking results in the Google search
results or using the Chrome extension to block things. You’re actually talking about
people who are rating results. I’m also going to assume
that you don’t mean people doing web spam. So I’ve made other videos that
talks about how Google takes action and is willing to take
manual action on web spam, but you’re talking about raters. You used the word raters, so
let me drill down on that a little bit. Raters are really not used to
influence Google’s rankings directly, so let’s walk
through exactly how they are used. I’m not a member of the Search
Quality Evaluation Team. I work on web spam, but I can
basically paraphrase the process because that’s
where the human raters come into play. Suppose an engineer
has a new idea. They’re thinking, oh, I can
score these names differently if I reverse their order because
in Hungarian and Japanese that’s the sort of
thing where that can improve search quality. What you would do is we have
rated a large quantity of urls, and we’ve said this
is really good. This is bad. This url is spam. So there are 100s of raters who
are paid to, given a url, say is this good stuff? Is this bad stuff? Is it spam? How useful is it? Those sorts of things. Is it really, really
just essential, all those kinds of things. We also– so once you’ve gotten all those
ratings, your engineer has an idea. He says “OK, I’m going to change
the algorithm.” He changes the algorithm and does
a test on his machine or here at the internal corporate
network, and then you can run a whole bunch of different
queries. And you can say OK, what
results change? And you take the results the
change and you take the ratings for those results and
then you say overall do the return– do to the results that
are returned tend to be better, right? They’re the sort of things that
people rated a little bit higher rather than a
little bit lower? And if so, then that’s
a good sign, right? You’re on the right path. It doesn’t mean that it’s
perfect, like, raters might miss some spam or raters might
not notice some things, but in general you would hope that if
an algorithm makes a new site come up, then that new site
would tend to be higher rated than the previous site
that came up. So imagine that everything
looks good. It looks like it’s a
pretty useful idea. Then the engineer, instead of
just doing some internal testing, is ready to go through
sort of a launch evaluation where they say
how useful is this? And what they can do is they can
generate what’s called a side by side. And the side by side is exactly
what it sounds like. It’s a blind taste test. So
over here on the left-hand side, you’d have one set
of search results. And on the right-hand side
you’d have a completely different set of
search results. So one, two, three, four, five,
six, seven, eight nine ten, one, two, three, four,
five, six, seven, eight, nine, ten. And if you’re a rater, that is
a human rater, you would be presented with a query and
a set of search results. And given the query, what you
do is you say, “I prefer the left side, ” or “I prefer the
right side.” And ideally you give some comments like, “Oh,
yes, number two here is spam,” or “Number four here was
really, really useful.” Now, the human rater doesn’t
know which side is which, which side is the old algorithm
and which side is the new test algorithm. So it’s a truly blind taste
test. And what you do is you take that back and you look at
the stuff that tends to be rated as much better with the
new algorithm or much worse with the new algorithm. Because if it’s about the same
then that doesn’t give you as much information. So you look at the outliers. And you say, “OK, do
you tend to lose navigational home pages? Or under this query set do
things get much worse? And then you can look at the
rater comments, and you can see could they tell that things
were getting better? If things looked pretty good,
then we can send it out for what’s known as sort of
a live experiment. And that’s basically taking a
small percentage of users, and when they come to Google
you give them the new search results. And then you look and you say
OK, do people tend to click on the new search results a
little bit more often? Do they seem to like it better
according to the different ways that we try to
measure that? And if they do, then that’s
also a good sign. Now, people can get it wrong. For example, raters and just
regular users don’t always recognize spam. So you could launch some change
that got rid of a whole bunch of spam and people
might still think that that was not as good. So it’s no substitute for the
intuition and the experience that the search engine engineers
have, but we do take the evaluation and the results
of both the human raters, as well as the analysts who
evaluate those results very, very seriously. And we want to make sure that
we’re launching a change that’s overall a big improvement
or ideally at least an improvement
for users. So as you can see here, if I
rate this left or right as better, that doesn’t change
the algorithm. Really the human raters that are
used within the evaluation group are used to say we think
this would be better or we think this would be worse. But those ratings don’t directly
affect the search engine results. So very good question, I’m
glad you asked it. I’m glad it gave me an
opportunity to talk about how we think about when you want to
launch a search change how do you tell if it’s really
an improvement? How do you tell if you’ve
missed anything? Can you evaluate it in different
languages and see whether it looks better across
all those different languages? So those are the kinds of things
that we think about. But to just dispel the
misconception that there are a group of raters and when they
rate something is bad the– if you don’t think that this
result is as useful then it starts to drop in the rankings,
that doesn’t happen. The only time that that sort of
thing happens is when we’re taking action on web
spam, and that’s a completely different group. And we’ve talked a little bit
about those, and we could cover those in a different
video. But I hope that helps. I hope that explains a little
bit about how we think about whether to launch a search
change or not, and sort of explains when human raters are
used and what they’re used for and how their expertise
helps us make Google search results better. Thanks very much.

Danny Hutson

2 thoughts on “How does Google use human raters in web search?

  1. Why don't they just let everyone rate? I mean that would work better versus letting a few hundreds do the rating. Plus the human raters might not be as knowledgeable about the topic so they won't be able to decide which one's better compared to the other.

Leave a Reply

Your email address will not be published. Required fields are marked *