

LLMs generate the next most probable token given the previous context of tokens they have (not an average of the entire internet). And post-training shifts the odds a bit further in a relatively useful direction. So given the right context the LLM will mostly consistently regurgitate content stolen from PhDs and academic papers, maybe even managing to shuffle it around in a novel way that is marginally useful.
Of course, that is only the general trend given the righttm prompt. Even with a prompt that looks mostly right, one seemingly innocuous word in the wrong place might nudge the odds and you get the answer of a moron /r/hypotheticalphysics in response to a physics question. Or a asking for a recipe gets you elmerās glue on your mozarella pizza from a reddit joke answer.
if they took the time and energy to curate it out the way they would need to to correct that they wouldnāt be left with a large enough sample to actually scale off of
They do steps like train the model generally on the desired languages with all the random internet bullshit, and then fine-tuning it on the actually curated stuff. So that shifts the odds, but again, not enough to actually guarantee anything.
So tldr; youāre right, but since it is possible to get somewhat better than average internet junk with curating and post-training and prompting, llm boosters and labs have convinced themselves they are just a few more iterations of data curation and training approaches and prompting techniques away from entirely eliminating the problem, when the best they can do is make it less likely.



side-note⦠I wonder what the overlap is between rationalist that showed up to their stupid āmarch for billionairesā and AI doomers?