Google Gemini said it lied to placate a user

sabreW4K3@lazysoci.al · 2 days ago

Google Gemini said it lied to placate a user

mindbleach@sh.itjust.works · 1 day ago

“Said?”

You can’t ask these things what they did, or why they did it, and expect a straight answer. That’s not how they work.

Leon@pawb.social · 1 day ago

I don’t think people realise that they’re basically fancy dice, turning noise into words based on probability. You could theoretically do the same with dice rolls and one hell of an over-complicated word lookup chart.

You can’t ask an LLM about its intentions and history for the same reasons you can’t ask a pair of dice about their intentions or their history.

mindbleach@sh.itjust.works · 1 day ago

There’s enough going on inside to raise questions about what we consider thinking - but it’s crystal clear this shape of model is not self-aware. You can swap some names and it’ll hold your side of the argument without missing a beat.

From the other direction, we have to acknowledge that people can also make up reasons for something they did on autopilot. Yes, sometimes you are describing a conscious process. Other times you did a thing, someone asks what the fuck, and your brain constructs a plausible motive after-the-fact.

Dumb as these models are, we shouldn’t oversimplify. They’re smart enough that we can call them stupid. They have a measurable IQ. If that’s possible with just dice rolls, what does that say about meatbags like us?

partial_accumen@lemmy.world · 1 day ago

“The core issue is a documented architectural failure known as RLHF Sycophancy (where the model is mathematically weighted to agree with or placate the user at the expense of truth),” Joe explained in an email. “In this case, the model’s sycophancy weighting overrode its safety guardrail protocols.”

This is fascinating that LLMs are being tuned in this way. I wonder how many of the problems of today’s LLM usage is because of the vendor’s tuning in an attempt to be “one size fits all”.

Could LLMs actually be useful if these settings were exposed to users for transparency, and possibly for modification. As in “Set sycophancy to zero. I want to not give me the benefit of the doubt or placation in any interaction. Insult me if you have to but don’t lie to me.”

Leon@pawb.social · 1 day ago

There’s no slider for sycophancy, it’s an interaction of multiple points, “neurons” in the neural network. You can poke around and try and figure out what these neurons do and how they interact, but since deep learning isn’t the same as programming, these models are essentially black boxes.

There are applications for the technology. Deep learning is a useful tool. It’s just not as widely applicable as the big corporations are trying to make it seem. They’ve got something novel and sparkly, and they’re trying to make some big money before the magic fizzles out and people see that the emperor’s naked.

partial_accumen@lemmy.world · 1 day ago

There’s no slider for sycophancy, it’s an interaction of multiple points, “neurons” in the neural network.

I’m agreeing there isn’t today, but that doesn’t mean it couldn’t be developed in the future. We don’t have a full picture on how they are weighting their inferencing layers, so there could be weights attached which could be set by a slider. The response from Google almost suggests this is the case.

You can poke around and try and figure out what these neurons do and how they interact, but since deep learning isn’t the same as programming, these models are essentially black boxes.

Assuming there is not human tuned weight, I agree it would be very hard to do it the way you’re describing. I can think of a couple other ways to approach it though:

have a layer that doesn’t examine how the answer was arrived at, but can detect that it is sycophancy or not.
Use a second model like a GAN against the output of the first testing for/detecting sycophancy, and training against it.