Ognjen Regoje bio photo

Ognjen Regoje
But you can call me Oggy


I make things that run on the web (mostly).
More ABOUT me and my PROJECTS.

me@ognjen.io LinkedIn

Why are we expected to train Google's NN for free?

#captcha #ethics #product #spam

A user on HackerNews posed a question:

I keep seeing the captchas more and more whose obvious purpose is to train their self-driving NN, and they’re getting out of hand. Sometimes I have to work through 5+ images. How is that legal that they can just interrupt me on a whim and ask for me to do work for them? Is there an alternative solution available?

After you remove the hyperbole, the question that remains is: “Why are we expected to train Google’s NN?”

As with most questions, there is a short answer and a long one.

The short answer

You aren’t. Don’t use the service requiring the captcha.

But short answers lack nuance and aren’t very useful.

The longer answer

The long answer is the same as the answer to a different question: “Why is reCaptcha so prevalent?”

And if my experience is any indication, the answer to that is because:

  1. It’s necessary, or you end up with spam accounts
  2. The users are very familiar with it
  3. It works well

Spam

Spam accounts are real. Before implementing reCaptcha, we had more spam accounts than real ones.

Users are familiar with it

Because it’s so common, regular users are very familiar with it. Therefore, they know how to solve it and, very importantly, they don’t complain that it’s there.

Works well

It’s simple to implement.

It’s stable.

It doesn’t get broken.

I’ve also not seen it defeated yet.

And it’s free.

I’ve read anecdotes of situations where people had to go through ten challenges. But my equally anecdotal evidence is that I’ve never done more than two, but I’d guess my average is less than one.

So, that is why many websites use them. Objectively, besides the potential privacy issues that most regular users don’t care about, I can’t find much wrong with it. And that is why you, as the user, encounter it often.

But why do they get to use that data

reCaptcha is a good service provided for free. So, instead of payment, they get to use the data generated by it.

I think that’s fair.

I also think people forget that reCaptcha was an independent company that got acquired by Google. So, it’s not part of Google’s evil plan. Someone non-evil came up with it.

What are the alternatives?

funcaptcha

Github is the most prominent site that uses it. It’s pretty good.

I couldn’t find an obvious way to try implementing it, however. It seems that Arkose Labs provides it. I didn’t see an obvious way to sign up just for the funcaptcha, so it might be a part of some other bigger offering.

If more sites used it, you’d not have to train anyone’s NN for free.

hCaptcha

hCaptcha is the non-Google version that works in the same way and does the same thing. As a user, you’d still train a company’s NN for free.

With that said, an advantage hCaptcha might have is that it’s not Google, so some audiences might find it appealing.

Interestingly, Cloudflare made noise about switching to hCaptcha but seems to have quietly rolled it back, at least for a portion of their traffic. I wonder why.

Generating your own

There’s also the local libraries that generate images such as this:

Simple captcha

I’d guess that for the vast majority of sites, they are fine. But if I’m not mistaken, many have been defeated, so they don’t offer very good protection.

For you as a user, however, they mean that you will have to enter something. And sometimes, you will have to do it multiple times because it’s generated an image that’s too obfuscated. On the other hand, for reCaptcha sometimes you don’t have to do anything. Sometimes you do one challenge. And very rarely, you do more than one.

Phone number, photo ID, or credit card info

From an implementation perspective, these alternatives are a lot more complex.

From a user’s perspective, I would not do to them.

If I had to choose an alternative, I’d choose hCaptcha.

As a user, the way I see it, some anti-spam measures are here to stay for the foreseeable future. And I’d rather that it does something useful than just entering numbers for numbers’ sake.

Well, what about paid sites?

Paid sites use it on sign up if you create an account before picking a plan or entering payment information.

Without it, the service ends up with thousands of spam accounts in the first step of the funnel.

It’s also frequently used in the login form so that passwords can’t be tried programmatically.

So, I don’t find them unreasonable

Besides the potential privacy implications, I think the tradeoffs are not unreasonable.

I will try hCaptcha in the future. But for end-users, there wont be any discernible difference.