Getting a 'feel' of LLMs

There is a lot of noise out there about AGI, multitudes of definitions of “intelligence”, predictions of upcoming doom as well as promises of a bright future. It makes it difficult to build an internal understanding of what LLM is (and is not), or to build a ‘mental picture’ of how it operates. 

In this article we attempt to address exactly that - try to understand how LLMs see the world from the perspective of a non-technical person (but who've heard bits and pieces).


Humans' vs LLMs in cognitive processes

LLMs in a sense are made after humans: each trained model when "untouched" exists in the state of "perfect creative chaos" - just like when anxious humans seat quietly by themselves they see different random thoughts, images, memories popping out - so does an LLM when in quiet has “thoughts” appearing randomly. 

In their raw form, all LLMs simply type text in its most-probable form. The most-probable depends on the context, like when humans have conversations about movies - they are less likely to speak about very specific agricultural topic of the gardening fertiliser for tomatoes.

LLM predicts the next most probable word in the context of what was spoken so far.


The bigger the number of parameters of the LLM model (speaking in general) - the greater is the depth of its "thought" and its ability to see cause-effect chains. With more depth more it is better "tuned" to follow the laws of physics, psychological and technological connotations.

For example, if a “deep” model is asked to write a real story - it is unlikely to write that human character’s single step was 4m long - half a meter is more likely. It is not that the model necessarily is able to “visualise” a human, connect the dots between the anatomy and the situation in the story - it is simply unlikely to encounter textual descriptions of humans on Earth walking 4m-long steps in models experience (i.e. training dataset). 

On the opposite, a "shallow" model doesn't really grasp the human step lengths, or humans - it simply views text in the terms of the grammar - which words likely to follow what within a paragraph or a sentence.


LLMs are unlike humans. When we as kids learn language, we point our finger to an object (let's say apple) and we hear parents say "apple". We learn that the "sound sequence" corresponding to the 3D visual object (plus "taste" information, plus "touch" information) - is an "apple"-sound.

In human world - words are pointers to the information which we experience internally. Like you ate apple before, I ate apple before - we have two separate subjective experiences, where cells of our bodies made a bio-chemical map of apple compounds and it was stored in our brains. Stored not as "words" but as a raw information, sensory-expieriences one. When I say "apple taste" I refer to this information (bio-chemical map of taste subjectively experienced and stored as memory); when you hear the words “apple taste” - it is a 'database query' to retrieve your own subjective bio-chemical map of taste from memory.

In imaginary scenario of buying low quality lemonade, there can be a situation when I would say - 'it is an apple taste' and you would say - 'no, it is more like orange'; because our subjective bio-chemical maps are different. We are able to communicate that this way because both sides are able to relate words to the physical reality.


LLMs can't do that. LLMs don’t have the data of physical reality. They only have the data of how humans speak about the physical reality

If two humans have different ‘wording of opinions’ about the reality - reality itself stays the same.

If those two humans need to find a common understanding - they meet in the reality through senses - and reformulate the wording. Like:

- A: this table corner is sharp.

- B: no, why, it is rounded with plastic.

- A: I have just scratched myself with it somehow (reality reference point)

- B: Show me. Ah, crap. Sorry to hear that - probably the plastic moved. (reality-based sync)

LLMs can’t cross-reference with reality, they can only cross-reference worded opinions.


LLMs are like perfect stereotypical office workers - they don't know the real thing, but they are very good at formulating the wording in such way that it sounds solid.

LLMs are like stereotypical bad politicians, aiming to please the listener, to make the listener satisfied - which is far from addressing the real problems; rather a displacement of problem solved in reality with another satisfied client.

LLMs are like a stereotypical "bad coder" or "bad student" - who can speak the names of different technological terms and sound confident, but not really being able to create the comprehensive clear working solution when it comes to it.

LLMs in a way are debaters, which navigate word formulations in order to satisfy the given goal.


It is worth to mention that human neurones can be connected in a circle - meaning that we have an ability to imagine a circular process (a cycle, a loop, a wheel, a spiral). In other words, we can get an idea and follow it through in multiple iterations - mathematical induction is after all a method we all intuitively use to validate the consequences of our choices (if I’m going to buy a monthly subscription, a year from now the situation will be …)

LLM’s architectures does not allow it to imagine loops - there are mathematical reasons why it is likely to stay this way. In other words, LLMs can navigate condition trees (if/else choices) - but cannot completely follow a circular situation. They of course can write loops in the code - based on the memorised knowledge that “if I use this piece of code with a loop, a get that result”, however it is not an in-depth understanding or imagination - similar to human capacity to close eyes and imagine the self, running a circle after a circle.

In practice it means that LLMs are suitable to solve problems scoped within their respective width and depth, and this scope shouldn’t include circular iterative problems - model will just shortcut it, as it cannot understand it.


Context-Building as a way to direct LLMs

Imagine that I spoke now of Coca Cola. Even though I wrote a couple of keywords - isn't it that within your head there appears a picture of the Coca Cola bottle from some billboard, maybe there is even a bit of sense of a taste or a smell crossing mind - or small reactions like a change in the heart rate?

Physically, I haven’t send any images or sound tracks - yet associations carefully built by decades of never stopping advertisement campaigns appear in your had. So, this written text here “coexists” with the associations it had triggered in your head - you are more likely to feel thirsty, or to feel like you need some easy quick sugar.

This is more or less how many LLM-agents work, in particular those sitting on the top of the search engines. The "search results" text is pre-loaded into an LLM, and it is `Coca Cola` which makes model think in a certain direction, so instead of the raw 'ambiguous creative chaos' we have mentioned earlier, it is now directed by associations from the search results, as they are ‘more probable’ to be the answer. 

If results are good and consistent - LLMs are able to answer well the search inquiry. When it is not the case, and results are mixed or LLM is not well calibrated - the model will be trying to make sense from what it sees which can lead to hallucinations.

Notably, the message the actual 'LLM' receives is not as directly written by the user - it is constructed carefully and tested for best performance by ML Engineers. Chat interfaces are there for human eyes, but under the hood there are text templates into which user messages are embedded. Moreover, the whole process is usually multi-staged, where one LLM iteration would clarify the user's inquiry provided the context known (like user interests, role, time of writing etc) - and only then attempt to answer it. 


LLM Hallucinations, Cognitive Depth and Width

Hopefully by know the mental model is complete enough to get a gut sense what the famous “hallucinations” are about.

First of all, LLM lives in the haze of words without any non-textual understanding of factual physical reality. In its debater worldview, there are opinions which are more or less likely to please the given customer; and of course, all of those opinions are debatable - including what we consider as "ethics".

There are many subtitles which change the meaning of the word depends on the context (when, who and how speaks about what). Human bias is to assume that the meaning that my meaning is the one everybody uses - and it is very difficult to see through it. 

A mouse and a baker asking about "What is bakery?" look for very different answers. Yet, LLMs often don't know who is asking - so they answer something what they assume as the most likely.


Cognitive Depth and Cognitive Width should be familiar to all of is in terms of a daily living.

Let us talk about the latter first.

At workplaces we talk about a drop of performance because of the context switching or excessive multitasking. We refer in this article to those types of problems as cognitive width.

Excessive multitasking typically implies something like ”my brain can hold only that many details. If I’m made to work on problems X and Y which have more details than I can hold - of course my performance will drop”.

As humans we experience it when there are "too many things happening at once" - there might be a friend reaching out with emergency, plus family needs something from us to do, and there was a bill deadline, and boss at work asked to do an additional thing because Sarah is on the sick leave - not to mention the latest news on the TV.

What happens is that if we are not able to "re-center", focus, re-plan and execute things one by one as a sequence - we do all things but lose "accuracy": routine dish somehow is undercooked, family is upset, deadline is missed, email sent to a wrong address.

Unlike LLM’s, we as humans intuitively offload the less important details attuned to reality - the dinner might be a bit burned, but it is still food, and it is sufficient.

LLM’s don’t have this uncanny ability, and if they are overflown, they are just losing it. Because what is “more important detail” is a point of debate, model often fails to follow the attention-priority requested by user. So, user would receive what feels as ‘randomly chosen’ attention points compared to what was requested. It was very important to keep the right format - but it was missing, it was important to link A and B synthesis - but there are just separate conclusions for each, etc.


Cognitive depth as the name suggests relates to the focused attention spanning multiple layers simultaneously. In common speech we differentiate “the depth of knowledge” between prime schooler and high schooler. We differentiate between the degree of “emotional intelligence” - the degree of accuracy and detail in empathetic perception.

In a similar fashion, LLMs have a limit (influenced by their size and other parameters) of how deeply they can connect details. When the task requires something more - it loses it and gives a “surface” answer which sounds to the knowing person as a hallucination.

It is not unlike teachers asking their students questions to get a sense how much they have understood - are those “memorised” answers or have they “grasped the essence”.

Clever words are part of the problem, because students (and LLMs) can use “pick most probable” approach to answer questions and because of how sentences are constructed and when certain words are used - it might look as correct at the first glance. But a follow-up question triggering a “disconnected” response highlights the bluff.


In many senses hallucinations are not unlike the ones we encounter in human world - when people claim that they know something when they actually don’t. In other words, there is a degree of living in imagination of knowing, which is not real - which is hallucination.

The primary way of dealing with LLM hallucinations is to scale the problem down to particular LLM model's cognitive width and depth capacity, providing necessary context in a very clean and easy to follow fashion. 

Providing sufficient context is necessary to direct model from the literal internet dump of thoughts it inherited through its training - to the right direction. The context gives the right “debate” angle of reasoning - which helps LLM to decide what is more likely among the many-many possibilities for a given user.


Conclusion

I hope that the above analogies helped to get a basic intuition what LLMs can and cannot do, what are their fundamental constraints (like inability to truly imagine loops) and also - what are they great for.
To efficiently use LLMs (be it as code assistant, or chat-bots), users simply need to remember that irrelevant details spread LLM attention and reduce accuracy.
That LLMs have a certain mirror effect - well-formulated detailed respectful question gives a well-formulated accurate answer; lazy badly formulated question would give a similar response. Speculatively, model deduces the personality of the user from words and the very shape of sentences - and it is answering in the related vibe (as it is more likely to appear this way in the training data).
Almost like, if we see LLM as a condensed human knowledge - the very language we use becomes a query to the particular layer of that knowledge.

Wish you all the best on your LLM journey!

Comments

Popular Posts