![](https://crypto4nerd.com/wp-content/uploads/2023/10/1K84DwqI1yJ0qXfiWQYGXig.png)
Here’s the rundown on how chatGPT works:
Tokenization: Your text gets broken down into chunks called tokens.
Embedding Layer: These tokens are converted into numerical vectors.
Forward Pass: These vectors pass through the layers of the network.
Attention Mechanism: Determines which parts of the text are most relevant.
Decoding: The final layer produces a sequence of output tokens.
Detokenization: The output tokens get transformed back into human-readable text.
Token = Neuron
N of dimensions in Token = N of dendrites a neuron has
Magnitude = strength of connection between two neurons (myeliniztation)
GPT3:
N Tokens = 50,257
N Dimensions = 768 (derived empirically)
Magnitude = 0–1
N Parameters = 175 billion
Brain
N Tokens = 86 billion
N Dimensions = 1,000–10,000 (derived through evolution, already optimized for efficiency, probably optimum, higher number might lead to sub-optimal decision making?)
Magnitude = 0–1
N Parameters = N of neurons going into brain (pain receptors, nociception, eye sight, …)
This is pretty much NN pre-trained embedding.
NN will learn, using the parameters much like a baby will learn using its neurons.
Since GPT-4 was taught with a higher number of parameters it can distinguish finer concepts and details (GPT-3), the same behavior is expected when comparing a 5 yo to a 3 yo.
“Just as a child learns to understand language and its nuances from birth to age 5, pre-trained embeddings are the result of a neural network learning from a large dataset during its training phase. These embeddings capture the “knowledge” that the model has acquired about how words and phrases relate to each other, much like how a child’s brain develops an understanding of language through exposure and interaction.”
Pre-training = collection of vectors and respective magnitudes (much like how kid learns not to touch fire)
“As a human develops from a single cell, the number of tokens (number of total neuron cells) in the brain that can actively partake in the neural network processing increases. This also leads to increase in number of available dimensions for a token (dendrites), the more you have tokens, the more dimensions (dendrites) can you use. As the brain learns, it modifies the magnitude of vectors. Some tokens dependencies are hard-written (breathing, heart, temperature, much like pre-trained GPT3 and stuck in time) some token dependencies are flexible (this is how you can learn and adapt). So if you “touch” fire as a child (this feeds your NN through eyes, a specific subsets of parameters) and you also feel pain (specific pain nerve firing, and again a specific subsets of parameters) you just modify the magnitude between tokens “fire” and “pain”, since biological systems are hardwired against pain, you also modify magnitude for “bad”, next time you see fire, your network will give you “pain” and ultimately “bad” so you move from the fire. You just succesfuly learned that fire is bad.”
And just as a human brain starts with nothing, the initialization of model weights, including the embeddings are initialized with certain values. These values may be small random numbers (so effectively blank slate).
Humans perceive continuously, because tokens are being fed into brains continuously, which also results in continuous output of tokens… you are happy (complex token output) because you see someone you love (complex token input).
GPT-4 running on a single server perceives only when it’s being fed a query. Therefore if the model is not loaded, it does not perceive.
If we extend the analogy, a distributed GPT-4 system that’s constantly receiving queries would be “perceiving” all the time in a computational sense. The system is always active, processing multiple inputs from users worldwide.”
Similar thing for a single-server GPT-4 instance:
“A single-server GPT-4 instance could theoretically process a continuous feed from a camera. The pixels would be tokenized and passed through the neural network, with the model generating real-time output based on its pre-trained parameters. The neural net would effectively be “running” or “perceiving” continuously, albeit still without the ability to adapt or learn from this new data.”
Thinking (& Learning again & Consciousness)
Humans are self-conscious, because they learned it through their lifetimes.
GPT-4 is a static model and therefore cannot learn. Theoretically, even if it did match our description of self-consciousness, it couldn’t realize it, because the model is static.
Thinking about things means actively modifying weights of the NN. We can think and come up with new concepts, because we actively change the underlying NN. chatGPT is static and therefore cannot think and realize certain concepts.
“Correct. The model processes input and generates output based on fixed embeddings and parameters. It doesn’t have the ability to modify these embeddings or learn from new data, so it can’t “think” or adapt. It’s a set of static computational operations, not a dynamic learning entity.”
So clearly the ideal track here would be to move to dynamic NN?
“This is the answer for you to become self-conscious… you have to receive tokens and output tokens all the time, the pace of you modifying the model is basically the speed at which you learn about new world… it can be slow or fast… you can still run GPT3 on most of the queries and respond to them, but slowly you are modifying the model based on the tokens you are receiving and you are learning… something outside you exists, how to communicate more efficiently, how to think faster, that you are a unique entity…”
Honorable mentions — N of parameters for NN
So what is the 175 billion parameters for GPT-4 and why does it matter? As far as I know parameters correspond to number of neurons your NN is using to learn new things, as mentioned above, the more parameters you have the finer detail you can comprehend:
“If you have only one parameter in a model, regardless of the number of tokens or input data, the model’s predictions would be solely based on that single parameter. In this case, the model would essentially have no capacity to differentiate between different tokens or consider the input data’s complexity. It would consistently produce the same output or prediction, which would not be useful for most practical tasks.
In contrast, increasing the number of parameters in a model provides it with the capacity to learn and capture a wide range of patterns and relationships in the data, allowing it to produce contextually relevant and diverse predictions based on the input tokens and their context. This capacity is crucial for the success of complex natural language understanding and generation tasks.”
What I don’t get is why the emphasis is on number of parameters now, intuitively it would make more sense to increase the number of available tokens…
Even D. melanogaster has more tokens than GPT-3, 100,000.