I have a theory about this. All these LLMs are trained on mostly written texts. ...

h-jones · on April 11, 2023

If you’re looking for research along these directions, Melanie Mitchell at the Santa Fe institute explores these areas. There are better references from her, but this is what came to mind https://medium.com/p/can-a-computer-ever-learn-to-talk-cf47d....

famouswaffles · on April 11, 2023

LLMs can simulate inner voices pretty well. The way they've handled memory here isn't actually necessary and there are a number of agentic gpt papers out to show that (reflexion, self-refine etc) I can see why they did it though (helps a lot for control/observation)

crooked-v · on April 11, 2023

> The way they've handled memory here isn't actually necessary

I'm curious if there are other methods you can point at that would handle arbitrarily long sets of 'memories' in an effective way. The use of embeddings and vector searches here seems like a way to sidestep that that's both powerful and easy to understand, and easy to generalize into multi-level referencing if there's enough space in the context window.

famouswaffles · on April 11, 2023

Every method so far basically uses embeddings and vector searches. what i mean is how the LLM processes/uses that information doesn't need to be this handholdy.

frozenlettuce · on April 11, 2023

I guess that we could hook those AIs into a first person GTA 5 and see what happens. Every second take a screenshot, feed into facebookresearch/segment-anything, describe the scene to chat gpt, receive input, repeat.

barking_biscuit · on April 11, 2023

Someone needs to start a Twitch account or YouTube channel focused around getting AI to play games like this through things like AutoGPT and Jarvis and just see what the hell it gets up to, what the failure modes are, and if it can succeed etc.

abrichr · on April 11, 2023

This is known as "embodied cognition". Current approaches involve collecting data that an agent (e.g. humanoid robot) experiences (e.g. video, audio, joint positions/accelerations), and/or generating such data in simulation.

See e.g. https://sanctuary.ai

goldenkey · on April 11, 2023

It's already multimodal, as entropy is... entropy. In sound, vision, touch and more, the essence of universal symmetry and laws get through such that the AI can generalize across information patterns, not specifically text -- think of it as input instead.

Try prompts like: https://news.ycombinator.com/item?id=35510705

Encode sounds, images, etc in low resolution, and the LLM will be able to describe directions, points in time in the song, etc.

These LLM can spit out an ASCII image of text, or a different language, or code, etc. They understand representation versus an object.