Some Reinforcement Learning: The Greedy and Explore-Exploit Algorithms for the Multi-Armed Bandit Framework in Python

In this article the multi-armed bandit framework problem and a few algorithms to solve the problem is going to be discussed. This problem appeared as a lab assignment in the edX course DAT257x: Reinforcement Learning Explained by Microsoft. The problem description is taken from the assignment itself.

The Problem Statement and Some Theory

Given a set of  actions with some unknown reward distributions,  maximize the cumulative reward by taking the actions sequentially, one action at each time step and obtaining a reward immediately.  This is the traditional explore-exploit problem in reinforcement learning. In order to find the optimal action, one needs to explore all the actions but not too much. At the same time, one needs to exploit the best action found so-far by exploring.

The following figure defines the problem mathematically  and shows the exploration-exploitation dilemma in a general setting of the reinforcement learning problems with sequential decision with incomplete information.

f0
The following figure shows a motivating application of the multi-armed bandit problem in drug discovery.  Given a set of experimental drugs (each of which can be considered an arm in the bandit framework) to be applied on a set of patients sequentially, with a reward 1 if a patient survives after the application of the drug and 0 if he dies, the goal is to save as many patients as we can.

f4.png
The following figures show the naive and a few variants of the greedy algorithms for maximizing the cumulative rewards.

  • The Naive Round-Robin algorithm basically chooses every action once to complete a round and repeats the rounds. Obviously it’s far from optimal since it explores too much and exploits little.
  • The greedy algorithm tries to choose the arm that has maximum average reward, with the drawback that it may lock-on to a sub-optimal action forever.
  • The epsilon greedy and optimistic greedy algorithms are variants of the greedy algorithm that try to recover from the drawback of the greedy algorithm. Epsilon-greedy chooses an action uniformly at random with probability epsilon, whereas
    the optimistic greedy algorithm initialized the estimated reward for each action to a high value, in order to prevent locking to a sub-optimal action.

f1

The following code from the github repository of the same course shows how the basic bandit Framework can be defined:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
import numpy as np
import sys
 
### Interface
class Environment(object):
 
def reset(self):
raise NotImplementedError('Inheriting classes must override reset.')
 
def actions(self):
raise NotImplementedError('Inheriting classes must override actions.')
 
def step(self):
raise NotImplementedError('Inheriting classes must override step')
 
class ActionSpace(object):
 
def __init__(self, actions):
self.actions = actions
self.n = len(actions)
 
### BanditEnv Environment
 
class BanditEnv(Environment):
 
def __init__(self, num_actions = 10, distribution = "bernoulli", evaluation_seed="387"):
super(BanditEnv, self).__init__()
 
self.action_space = ActionSpace(range(num_actions))
self.distribution = distribution
 
np.random.seed(evaluation_seed)
 
self.reward_parameters = None
if distribution == "bernoulli":
self.reward_parameters = np.random.rand(num_actions)
elif distribution == "normal":
self.reward_parameters = (np.random.randn(num_actions),
np.random.rand(num_actions))
elif distribution == "heavy-tail":
self.reward_parameters = np.random.rand(num_actions)
else:
print("Please use a supported reward distribution") #, flush = True)
sys.exit(0)
 
if distribution != "normal":
self.optimal_arm = np.argmax(self.reward_parameters)
else:
self.optimal_arm = np.argmax(self.reward_parameters[0])
 
def reset(self):
self.is_reset = True
return None
 
def compute_gap(self, action):
if self.distribution != "normal":
gap = np.absolute(self.reward_parameters[self.optimal_arm] -
self.reward_parameters[action])
else:
gap = np.absolute(self.reward_parameters[0][self.optimal_arm] -
self.reward_parameters[0][action])
return gap
 
def step(self, action):
self.is_reset = False
 
valid_action = True
if (action is None or action = self.action_space.n):
print("Algorithm chose an invalid action; reset reward to -inf")#, flush = True)
reward = float("-inf")
gap = float("inf")
valid_action = False
 
if self.distribution == "bernoulli":
if valid_action:
reward = np.random.binomial(1, self.reward_parameters[action])
gap = self.reward_parameters[self.optimal_arm] -
self.reward_parameters[action]
elif self.distribution == "normal":
if valid_action:
reward = self.reward_parameters[0][action] + self.reward_parameters[1][action] * np.random.randn()
gap = self.reward_parameters[0][self.optimal_arm] - self.reward_parameters[0][action]
elif self.distribution == "heavy-tail":
if valid_action:
reward = self.reward_parameters[action] + np.random.standard_cauchy()
gap = self.reward_parameters[self.optimal_arm] - self.reward_parameters[action] #HACK to compute expected gap
else:
print("Please use a supported reward distribution")#, flush = True)
sys.exit(0)
 
return(None, reward, self.is_reset, '')
 
#Policy interface
class Policy:
#num_actions: (int) Number of arms [indexed by 0 ... num_actions-1]
def __init__(self, num_actions):
self.num_actions = num_actions
 
def act(self):
pass
 
def feedback(self, action, reward):
pass

Now in order to implement an algorithm we need to just extend (inherit from) the Policy base class (interface) and implement the functions act() and feedback() for that algorithm (policy).

In order to theoretically analyze the greedy algorithms and find algorithms that have better performance guarantees, let’s define regret as the gap in between the total expected reward with the action chosen by the optimal policy and the cumulative reward with a set of actions chosen by any algorithm (assuming that the reward distributions are known), as shown in the following figure. Hence, maximizing cumulative reward is equivalent to minimizing the regret.

Given that greedy exploits too much an epsilon greedy explores too much, it can be shown that all the greedy variants have regrets linear in the number of timesteps T.

Also, it was theoretically proven by Lai and Robbins,that the lower bound on the regret is logarithmic in the number of timesteps T.

 

f2


  • The next figure shows the reward distributions for a 5-armed bandit framework.
  • If we take an action A_i, with probability p_i we shall get a reward of 1 and with probability 1-p_i we shall get a reward of 0.
  • Hence, each arm A_i has reward that is distributed as a Bernoulli random variable with some parameter p_i,  so we have R_i ~ B(p_i), i=0..4.
  • We first generate the parameters for the distribution corresponding to each arm randomly.
  • As we can see, the arm 2 has the highest p_i, so if we choose arm 2, we have the highest probability to get a reward of 1 at any timestep.
  • Obviously, the arm 2 is the arm/action that gives the most promising reward and any optimal policy should choose that arm at all timesteps.

 

gr

Now, let’s assume that we don’t know p_i values and we use the Greedy and the Naive Round-Robin algorithms to maximize the cumulative rewards over 10000 timesteps.

As can be seen from the next animations and figures

  1. Greedy algorithm locks into the arm/action 0 and can’t find the optimal action.
  2. Round robin algorithm chooses the actions uniformly and can’t find the optimal action.
  3. Both Greedy and Round-Robin has linear regrets (w.r.t. timesteps).
  4. In this case the greedy algorithm even with sub-optimal action still performs relatively better than the round-robin.

Greedy
ggcrgcrg

 

Round-Robin

rrrrcrrrcrg


 

  • The next figure shows the reward distributions for a 10-armed bandit framework.
  • If we take an action A_i, with probability p_i we shall get a reward of 1 and with probability 1-p_i we shall get a reward of 0.
  • Hence, each arm A_i has reward that is distributed as a Bernoulli random variable with some parameter p_i,  so we have R_i ~ B(p_i), i=0..9.
  • We first generate the parameters for the distribution corresponding to each arm randomly.
  • As we can see, the arm 6 has the highest p_i, so if we choose arm 6, we have the highest probability to get a reward of 1 at any timestep.
  • Obviously, the arm 6 is the arm/action that gives the most promising reward and any optimal policy should choose that arm at all timesteps.

 

egr.png

Again, let’s assume that we don’t know p_i values and we use the Epsilon-Greedy (with different values of the hyper-parameter ε) and the Optimistic-Greedy (with different values of the hyper-parameter R) algorithms to maximize the cumulative rewards over 10000 timesteps.

As can be seen from the next animations and figures

  1. Epsilon-Greedy algorithm behaves exactly like Greedy when ε = 0 and behaves randomly when ε = 1. For both of these cases, the ε-greedy algorithm has linear regret.
  2.  ε = 0.1 and ε = 0.15 find the optimal arm 6 eventually and they have sub-linear regrets. 
  3. Optimistic-Greedy algorithm behaves exactly like Greedy when R = 0 and behaves randomly when R = 10000. For both of these cases, the ε-greedy algorithm has linear regret.
  4. R = 3 and R = 5 find the optimal arm 6 eventually and they have sub-linear regrets(w.r.t. timesteps).

Epsilon-Greedy

ε = 0

og_0

egcrg_1

ε = 0.05

eg_.05

egcrg_.05

ε = 0.1

eg_.10egcrg_.1

ε = 0.15

eg_.15egcrg_.15

ε = 1.0

eg_1egcrg_1

 

Optimistic Greedy

ogcr

R = 0

og_0ogcrg0

R = 1

og_1

ogcrg1.png

 

R = 3

og_3.gif

ogcrg3.png

R = 5

og_5.gif

ogcrg5.png

R = 10000

og_10000.gif

ogcrg10000.png

The next figure shows two algorithms (UCB and Bayesian Thompson-Beta Posterior Sampling) that achieve logarithmic regret.

f3


  • The next figure shows the reward distributions for a 10-armed bandit framework.
  • If we take an action A_i, with probability p_i we shall get a reward of 1 and with probability 1-p_i we shall get a reward of 0.
  • Hence, each arm A_i has reward that is distributed as a Bernoulli random variable with some parameter p_i,  so we have R_i ~ B(p_i), i=0..9.
  • We first generate the parameters for the distribution corresponding to each arm randomly.
  • As we can see, the arm 4 has the highest p_i, so if we choose arm 4, we have the highest probability to get a reward of 1 at any timestep.
  • Obviously, the arm 4 is the arm/action that gives the most promising reward and any optimal policy should choose that arm at all timesteps.

tr

Again, let’s assume that we don’t know p_i values, we  implement and use the UCB1 and Thompson-Beta algorithms to maximize the cumulative rewards over 10000 timesteps.

As can be seen from the next animations and figures

  1. Both the algorithms find the optimal arm 4 pretty quickly without much of exploration.
  2. Both the algorithms achieve logarithmic regret when the (unknown) reward distribution is Bernoulli.
  3. For posterior sampling with Thompson Beta, each arm’s reward is sampled from the posterior β distribution from the same exponential family with the unknown Bernoulli distributed rewards, starting with the non-informative flat prior.
  4. Posterior R_i ~ β(a_i, b_i) where a_i is the number of times we obtained a reward 0 and b_i is the number of times we obtained a reward 1, when the arm A_i was drawn, as shown in the figure.
  5. The Thompson sampling gives linear regret when the (unknown) reward distribution is normal.

UCB

ucb.gif

ucrucrg

Thompson Beta

posterior_thomsonbeta000000posterior_thomsonbeta010000tbtbposttcrtcrg

Thompson Normal

tcrgN>tcrN