Explore and Exploit

Life is like a sparse reward problem in Reinforcement Learning (RL). And just like in RL, you are not provided with the transition model. You are only left with the option to sample some rewards through your actions and learn from them. Before exploring this line of thought, we first will look at the skill (or education) debt. It is a situation where you somehow get away with not learning something important, but later on you have to pay it back – either by the currency of time or effort. An immediate example of this is the skill of communication, we get away with not learning it in our school days, but later in life, we have to pay it back by putting in extra effort to learn it, especially when we are at the point where we are now expected to know it. There is a very famous and similar term in tech industry called “technical debt”, which is the cost of additional rework caused by choosing an easy immediate solution instead of using a better long sighted approach or making tough choices. So how can we avoid skill debt? We are not handed a perfect model of life, work and the world, therefore we can only resort to either the samples we have or we can get by performing some actions. It is very common to blame schools and universities for the skill debt, feeling that they are responsible for not teaching us skills that are important in our lives. For example, we often listen to opinions like school doesn’t teach us to file taxes, or it only teaches to memorize and not to think critically. But is it really on the system? Or is it on us? Can’t we learn to file taxes based on our past experiences and general learning ability? For most people and ordinary situations, this works well. Of course, in more complex or unusual cases we will need some more expertise but even then there will always be someone who has studied the rules formally and can handle them an expert way. In case of memorisation, it is our problem that we frequently took shortcuts and got away with it, we did not allow ourselves to fail and explore. We chose a sub-optimal path, which worked for everyone else, and avoided the risk of tiny embarrassments of failure. While schools are partially responsible for this situation by not creating environment encouraging exploration, they are not the only ones to blame. For instance, I avoided any opportunity to speak in class, every opportunity to get on stage, and every opportunity to disagree with the instructors or peers. We (not all) got away with it, and we now or somewhere in our life we have to pay the skill debt of communication. School provided us some opportunities, but the environment was not very conducive for exploration. Even then, ultimately, we lacked the courage to surpass that embarrassment versus learning barrier. Now, at this point in life, the stakes are so high that we either don’t try, or if we try, we lack the skills and hate the situation. So till now we have seen that to explore is very important for our learning. Now the second part is to exploit the skills we have. Once we have explored enough and learned some skills, we get the advantage of Skill-Begets-Skill. It is an idea that once you have learned some skills, it becomes easier to learn new skills. Not only that, like Matthew Effect in sociology, the more you have, the more you get. You become more articulate (which literally means “jointed”), and you get more opportunities to apply those skills. The more you apply those skills, the more samples you get, and as an effect you get a more precise model of the world or the area you are skilled in. To make it clear, think of yourself as a good chemistry student. Your classmates may ask you for help in chemistry, considering you are also helpful, yes being good human is not optional for this whole setup. By helping them you may come across some new questions that are beyond your current reach, and these are now anchors for you to explore and learn more.

Frozen Lake Environment

Frozen Lake is a reinforcement learning environment where an agent learns to navigate from a starting point (S) to a goal (G) on a grid of frozen tiles, while avoiding holes (H).

Frozen Lake

The movements are stochastic, which means the agent does not always move exactly in the direction you want it to move, just like the real world. So each action has a probability distribution over possible outcomes. Like if the agent chooses to move right, it might indeed move right with high probability or slip and move up or down with smaller probabilities.

Sounds familiar? Even if you take best actions, you are not guaranteed the best outcome, we call it luck.

Each step gives zero reward, falling into a hole ends the episode with failure, and reaching the goal yields a positive reward. Due to this sparse reward structure, the agent must balance exploration and exploitation to learn an optimal policy (strategy of choosing actions). In this environment, we know the transition probabilities, but in RL problems and also in real life, we do not have access to the transition model, we can only sample from it by taking actions and observing the outcomes.

Abuse of the word skill

The word skill is popularly misused to refer very short term or narrowly defined skills, whereas in reality it should be used to refer to a more long term underlying abilities. For example, we often hear someone being good at coding, but that is one of the tools, the real skill is problem solving using that tool. Though this is not a very good example as it requires more nuance, coding becomes a skill when you are too good at it, so much so that you can use it to solve problems or debug the code in a very efficient way that too in some particular language. But just knowing the syntax and semantics of a programming language is not a skill, it is just a tool. The real skill is to use that tool to solve problems. Also now as it is very easy to build things using tools, it is very common to get away with not learning the underlying skill. This skill debt leads to a situation (illusion of competence) where you can build things, and are overly confident about your ability to build things, but you are not really solving problems. It is the same situation as the one where you look at the solution of a problem, and you think you understand it, but when you try to solve a similar problem on the exam or even worse when your crush asks for help in solving a similar problem and you totally blank out. Compare it with being a mechanic or an engineer, you can be a good mechanic and thus repair the car, but you are not really an engineer until you understand how the car works and can modify it to make it better. But to add the nuance, a very good mechanic can also be a very good engineer, but not all mechanics are engineers. A good mechanic with a deep understanding of the car will be more pragmatic and efficient in repairing it, and may even be able to innovate and improve the design of the car if they have the opportunity to do so. Ultimately, the point is that a person who has a better model of how the car works in their domain will be more efficient, regardless of how they acquired that model.

Authentic choices

Current authentic choices are more important than the immediate risk of failure or greed of sub-optimal success. Short term losses are not bad, if you do not incur very big setbacks, it will pay off in the long run. If you want to take that course, even if it may harm your GPA by some points, you should take it if you think it will open up opportunities in future. Judgement, the ability to believe in your decisions and learning how to learn is far more important than any single concept you will ever learn. And this judgement is only developed by taking authentic choices by yourself.

Learning requires piqued curiosity

Necessity is the mother of education, rather than fighting the system and blaming it, understand and maybe exploit it. If you are intrigued by, say, the Marquee effect in HTML, try using it and experimenting with it. You may discover related concepts and gain a better first-hand experience with the tool, even if you never actually use it again or if it won’t fetch you grades or a job. Try to do things rather than just thinking about doing things, do not be cocky just because you think differently. Try it, even if it is boring. If not an expert, just be average at most of the concepts people (or school) expect you to know, and expert at what you love. You have limited time, figure out what you love. If you love movies, do your homework on it, figure out how they are made, who makes it. What you like about it? Do you notice the lenses they use? Do you like some particular genre or director? Be interested. Should everyone watching movies know these things. No, but you said you love movies. What is special about you? If you are not different from others, then you are just consuming. It should have some impact on you; if not, then it is just another form of doomscrolling, only in long format. Do not get me wrong, I am not saying everything should be commodified, the point is that you cannot blame everyone else for teaching you boring things unless you yourself do not have something more useful and interesting for you already. And even then, most of the things you learn in school may seem useless to you but are useful for someone else in your own class.

Takeaway

I would now expand on the first point that life is like a sparse reward problem in Reinforcement Learning. In sparse reward problems, the agent receives rewards only after a long sequence of actions, and it is not clear which actions led to the reward. Similarly, in the life, we do not receive immediate feedback and it is very difficult to attribute our successes and failures to specific actions. It is only in hindsight that we can look back and analyze the sequence of actions that led to a particular outcome.

Thus, it is better to be \(\epsilon\)-greedy in life, where we explore new opportunities with a small probability \(\epsilon\), and exploit our existing skills and knowledge with a probability of \(1-\epsilon\). In this way, we can balance the trade-off between exploration and exploitation, and can understand the model better and quickly. This model will then help us make better decisions in the future. What is the optimal \(\epsilon\)? It is something we must figure out for ourselves based on our circumstances and goals. In other words, \(\epsilon\) can only be determined by taking exploratory actions and continuously reflecting on the outcomes.