Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I have a theory about this. All these LLMs are trained on mostly written texts. That's only a tiny part of our brain's output. There are other things as important, if not more, for learning how to think. Things that no one has ever written about: the most basic common senses, physics, inner voices. How do we get enough data to train on those? Or do we need a different training algo which requires less data?


If you’re looking for research along these directions, Melanie Mitchell at the Santa Fe institute explores these areas. There are better references from her, but this is what came to mind https://medium.com/p/can-a-computer-ever-learn-to-talk-cf47d....


LLMs can simulate inner voices pretty well. The way they've handled memory here isn't actually necessary and there are a number of agentic gpt papers out to show that (reflexion, self-refine etc) I can see why they did it though (helps a lot for control/observation)


> The way they've handled memory here isn't actually necessary

I'm curious if there are other methods you can point at that would handle arbitrarily long sets of 'memories' in an effective way. The use of embeddings and vector searches here seems like a way to sidestep that that's both powerful and easy to understand, and easy to generalize into multi-level referencing if there's enough space in the context window.


Every method so far basically uses embeddings and vector searches. what i mean is how the LLM processes/uses that information doesn't need to be this handholdy.


I guess that we could hook those AIs into a first person GTA 5 and see what happens. Every second take a screenshot, feed into facebookresearch/segment-anything, describe the scene to chat gpt, receive input, repeat.


Someone needs to start a Twitch account or YouTube channel focused around getting AI to play games like this through things like AutoGPT and Jarvis and just see what the hell it gets up to, what the failure modes are, and if it can succeed etc.


This is known as "embodied cognition". Current approaches involve collecting data that an agent (e.g. humanoid robot) experiences (e.g. video, audio, joint positions/accelerations), and/or generating such data in simulation.

See e.g. https://sanctuary.ai


It's already multimodal, as entropy is... entropy. In sound, vision, touch and more, the essence of universal symmetry and laws get through such that the AI can generalize across information patterns, not specifically text -- think of it as input instead.

Try prompts like: https://news.ycombinator.com/item?id=35510705

Encode sounds, images, etc in low resolution, and the LLM will be able to describe directions, points in time in the song, etc.

These LLM can spit out an ASCII image of text, or a different language, or code, etc. They understand representation versus an object.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: