introduction

basic structure

the agent’s job
select action to maximize cumulative reward
models
- [[Multi-armed Bandit]] simplest
- [[Markov Decision Process]]

key feature of reinforcement learning

no teacher, no labels (not [[supervised learning]])
trial-and-error search
possibility of delayed reward
the need to trade-off: explore and exploit

Review: Probaility and Random Variable

properties of probability

$$P(⋃{i=1}^n)=∑{i=1}^nP(A_i)$$

$$P(A∪B)=P(A)+P(B)−P(A∩B)$$

expectation $$E(X)=∑ixi⋅P(X=x_i)$$ $$E(X)=∫{−∞}^∞xf(x)dx$$

variance $$Var(X)=E(X^2)−(E(X))2$$

cumulative distribution function $$F_X(x)=P(X≤x)$$

Review: Poisson and Exponential Distribution

Review: Independence, Correlation, Conditional Probability

Uncorrelated 안에 independent 가 있다.

x	y	p
0	0	0.25
1	1	0.25
1	1	0.25
2	0	0.25

E(X)=0×0.25+1×0.25+1×0.25+2×0.25=1
E(Y)=0×0.25+1×0.25+1×0.25+0×0.25=0.5
E(XY)=0/4+1/4+1/4+0/4=0.5=E(X)E(Y)
P(X=0) = 0.25 P(X=1) = 0.5 P(X=2) = 0.25
P(Y=0) = 0.25 + 0.25 = 0.5 P(Y=1) = 0.5
P(X=0, Y=0) = 0.25 ≠ 0.25×0.5
P(X=1, Y=1) = 0.25 + 0.25 = 0.5 (두 번 나옴) ≠ 0.5×0.5
P(X=2, Y=0) = 0.25 ≠ 0.25×0.5

Multi Armed Bandit

Bandits (또는 Multi-Arm Bandits, MAB)

Multi-Arm Bandits 문제는 이름에서 알 수 있듯이 슬롯 머신의 여러 팔을 가진 것처럼 상상될 수 있습니다. 각 팔은 다른 보상 분포를 가지며, 플레이어는 어떤 팔을 당겨서 최대의 보상을 얻을지 결정해야 합니다.

주요 개념

Arm: 선택할 수 있는 각각의 옵션입니다.
보상: 선택한 옵션(arm)에 따라 주어지는 값. 이 보상은 확률적입니다.
목표: 최대의 총 보상을 얻기 위해 어떤 arm을 선택할지 결정하는 것입니다.

탐색 vs 활용

이 문제의 핵심 도전 과제는 ‘탐색’과 ‘활용’ 사이의 균형을 찾는 것입니다:

탐색 (Exploration): 덜 알려진 arm을 선택하여 정보를 얻는 것입니다.
활용 (Exploitation): 지금까지의 정보를 바탕으로 가장 높은 보상을 줄 것으로 예상되는 arm을 선택하는 것입니다.

MAB 문제는 다양한 실제 상황에 적용될 수 있습니다. 예를 들어, 온라인 광고에서 어떤 광고가 사용자에게 효과적인지 판단하기 위해 MAB를 사용할 수 있습니다.

Multi-Arm Bandits (MAB)

1. 개요

Multi-Arm Bandits (MAB) 문제는 여러 선택 옵션(또는 “arms”) 각각이 다른 보상 분포를 가지며, 이를 통해 최대의 총 보상을 얻고자 하는 최적화 문제입니다.

2. 문제 정의

Arm: 선택할 수 있는 옵션. 각 arm ( i )는 보상 ( r_i )을 가집니다.
보상: arm을 선택할 때마다 주어지는 값. 일반적으로 확률 분포에 의해 결정됩니다.
목표: 최대의 총 보상을 얻는 전략을 찾는 것.

3. 탐색 vs 활용

탐색 (Exploration): 아직 충분히 시도해보지 않은 arm을 선택하여 더 많은 정보를 얻는 것.
활용 (Exploitation): 현재까지의 정보를 기반으로 가장 높은 보상을 제공할 것으로 예상되는 arm을 선택하는 것.

4. 알고리즘

ε-greedy:
- 확률 ( \epsilon )로 무작위 arm을 선택하고, 확률 ( 1 – \epsilon )로 지금까지의 정보를 바탕으로 가장 높은 평균 보상을 제공한 arm을 선택합니다.
- 수식:
  $$
  A_t =
  \begin{cases}
  \text{argmax}_a Q(a) & \text{with probability } 1 – \epsilon \
  \text{random arm} & \text{with probability } \epsilon
  \end{cases}
  $$
UCB (Upper Confidence Bound):
- 각 arm의 평균 보상과 해당 arm을 선택한 횟수를 고려하여 선택합니다. 덜 선택된 arm에 대한 불확실성을 반영하여 그 arm을 선택할 확률을 높입니다.
- 수식:
  $$
  A_t = \text{argmax}_a \left( Q(a) + \sqrt{\frac{2 \ln t}{N(a)}} \right)
  $$
  여기서, ( Q(a) )는 arm ( a )의 평균 보상, ( N(a) )는 시간 ( t )까지 arm ( a )가 선택된 횟수입니다.
Thompson Sampling:
- 각 arm의 보상에 대한 베이지안 추정을 기반으로 확률적으로 arm을 선택합니다. 각 arm의 성공 및 실패 횟수에 따라 베타 분포를 업데이트하고, 이 분포에서 무작위로 샘플링하여 arm을 선택합니다.

Action-value Methods

Estimating action-values
- one simple way
  sum of rewards when a taken upto t/ number of times a taken upto t

ε-greedy

limitation of ε-greedy

when the agent explores it selects equally amongst all actions
some actions are really bad; it will keep selecting them

#######################################################################

# Copyright (C) #

# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail.com) #

# 2016 Tian Jun(tianjun.cpp@gmail.com) #

# 2016 Artem Oboturov(oboturov@gmail.com) #

# 2016 Kenta Shimada(hyperkentakun@gmail.com) #

# Permission given to modify the code as long as you keep this #

# declaration at the top #

#######################################################################

import matplotlib

import matplotlib.pyplot as plt

import numpy as np

from tqdm import trange



# matplotlib.use('Agg')

%matplotlib auto




class Bandit:

# @k_arm: # of arms

# @epsilon: probability for exploration in epsilon-greedy algorithm

# @initial: initial estimation for each action

# @step_size: constant step size for updating estimations

# @sample_averages: if True, use sample averages to update estimations instead of constant step size

# @UCB_param: if not None, use UCB algorithm to select action

# @gradient: if True, use gradient based bandit algorithm

# @gradient_baseline: if True, use average reward as baseline for gradient based bandit algorithm

def __init__(self, k_arm=10, epsilon=0., initial=0., step_size=0.1, sample_averages=False, UCB_param=None,

gradient=False, gradient_baseline=False, true_reward=0.):

self.k = k_arm

self.step_size = step_size

self.sample_averages = sample_averages

self.indices = np.arange(self.k)

self.time = 0

self.UCB_param = UCB_param

self.gradient = gradient

self.gradient_baseline = gradient_baseline

self.average_reward = 0

self.true_reward = true_reward

self.epsilon = epsilon

self.initial = initial



def reset(self):

# real reward for each action

self.q_true = np.random.randn(self.k) + self.true_reward



# estimation for each action

self.q_estimation = np.zeros(self.k) + self.initial



# # of chosen times for each action

self.action_count = np.zeros(self.k)



self.best_action = np.argmax(self.q_true)



self.time = 0



# get an action for this bandit

def act(self):

if np.random.rand() < self.epsilon:

return np.random.choice(self.indices)



if self.UCB_param is not None:

UCB_estimation = self.q_estimation + \

self.UCB_param * np.sqrt(np.log(self.time + 1) / (self.action_count + 1e-5))

q_best = np.max(UCB_estimation)

return np.random.choice(np.where(UCB_estimation == q_best)[0])



if self.gradient:

exp_est = np.exp(self.q_estimation)

self.action_prob = exp_est / np.sum(exp_est)

return np.random.choice(self.indices, p=self.action_prob)



q_best = np.max(self.q_estimation)

return np.random.choice(np.where(self.q_estimation == q_best)[0])



# take an action, update estimation for this action

def step(self, action):

# generate the reward under N(real reward, 1)

reward = np.random.randn() + self.q_true[action]

self.time += 1

self.action_count[action] += 1

self.average_reward += (reward - self.average_reward) / self.time



if self.sample_averages:

# update estimation using sample averages

self.q_estimation[action] += (reward - self.q_estimation[action]) / self.action_count[action]

elif self.gradient:

one_hot = np.zeros(self.k)

one_hot[action] = 1

if self.gradient_baseline:

baseline = self.average_reward

else:

baseline = 0

self.q_estimation += self.step_size * (reward - baseline) * (one_hot - self.action_prob)

else:

# update estimation with constant step size

self.q_estimation[action] += self.step_size * (reward - self.q_estimation[action])

return reward




def simulate(runs, time, bandits):

rewards = np.zeros((len(bandits), runs, time))

best_action_counts = np.zeros(rewards.shape)

for i, bandit in enumerate(bandits):

for r in trange(runs):

bandit.reset()

for t in range(time):

action = bandit.act()

reward = bandit.step(action)

rewards[i, r, t] = reward

if action == bandit.best_action:

best_action_counts[i, r, t] = 1

mean_best_action_counts = best_action_counts.mean(axis=1)

mean_rewards = rewards.mean(axis=1)

return mean_best_action_counts, mean_rewards




def figure_2_1():

plt.violinplot(dataset=np.random.randn(200, 10) + np.random.randn(10))

plt.xlabel("Action")

plt.ylabel("Reward distribution")

plt.savefig('../images/figure_2_1.png')

plt.close()




def figure_2_2(runs=2000, time=1000):

epsilons = [0, 0.1, 0.01]

bandits = [Bandit(epsilon=eps, sample_averages=True) for eps in epsilons]

best_action_counts, rewards = simulate(runs, time, bandits)



plt.figure(figsize=(10, 20))



plt.subplot(2, 1, 1)

for eps, rewards in zip(epsilons, rewards):

plt.plot(rewards, label='$\epsilon = %.02f$' % (eps))

plt.xlabel('steps')

plt.ylabel('average reward')

plt.legend()



plt.subplot(2, 1, 2)

for eps, counts in zip(epsilons, best_action_counts):

plt.plot(counts, label='$\epsilon = %.02f$' % (eps))

plt.xlabel('steps')

plt.ylabel('% optimal action')

plt.legend()



plt.savefig('../images/figure_2_2.png')

plt.close()




def figure_2_3(runs=2000, time=1000):

bandits = []

bandits.append(Bandit(epsilon=0, initial=5, step_size=0.1))

bandits.append(Bandit(epsilon=0.1, initial=0, step_size=0.1))

best_action_counts, _ = simulate(runs, time, bandits)



plt.plot(best_action_counts[0], label='$\epsilon = 0, q = 5$')

plt.plot(best_action_counts[1], label='$\epsilon = 0.1, q = 0$')

plt.xlabel('Steps')

plt.ylabel('% optimal action')

plt.legend()



plt.savefig('../images/figure_2_3.png')

plt.close()




def figure_2_4(runs=2000, time=1000):

bandits = []

bandits.append(Bandit(epsilon=0, UCB_param=2, sample_averages=True))

bandits.append(Bandit(epsilon=0.1, sample_averages=True))

_, average_rewards = simulate(runs, time, bandits)



plt.plot(average_rewards[0], label='UCB $c = 2$')

plt.plot(average_rewards[1], label='epsilon greedy $\epsilon = 0.1$')

plt.xlabel('Steps')

plt.ylabel('Average reward')

plt.legend()



plt.savefig('../images/figure_2_4.png')

plt.close()




def figure_2_5(runs=2000, time=1000):

bandits = []

bandits.append(Bandit(gradient=True, step_size=0.1, gradient_baseline=True, true_reward=4))

bandits.append(Bandit(gradient=True, step_size=0.1, gradient_baseline=False, true_reward=4))

bandits.append(Bandit(gradient=True, step_size=0.4, gradient_baseline=True, true_reward=4))

bandits.append(Bandit(gradient=True, step_size=0.4, gradient_baseline=False, true_reward=4))

best_action_counts, _ = simulate(runs, time, bandits)

labels = [r'$\alpha = 0.1$, with baseline',

r'$\alpha = 0.1$, without baseline',

r'$\alpha = 0.4$, with baseline',

r'$\alpha = 0.4$, without baseline']



for i in range(len(bandits)):

plt.plot(best_action_counts[i], label=labels[i])

plt.xlabel('Steps')

plt.ylabel('% Optimal action')

plt.legend()



plt.savefig('../images/figure_2_5.png')

plt.close()




def figure_2_6(runs=2000, time=1000):

labels = ['epsilon-greedy', 'gradient bandit',

'UCB', 'optimistic initialization']

generators = [lambda epsilon: Bandit(epsilon=epsilon, sample_averages=True),

lambda alpha: Bandit(gradient=True, step_size=alpha, gradient_baseline=True),

lambda coef: Bandit(epsilon=0, UCB_param=coef, sample_averages=True),

lambda initial: Bandit(epsilon=0, initial=initial, step_size=0.1)]

parameters = [np.arange(-7, -1, dtype=np.float32),

np.arange(-5, 2, dtype=np.float32),

np.arange(-4, 3, dtype=np.float32),

np.arange(-2, 3, dtype=np.float32)]



bandits = []

for generator, parameter in zip(generators, parameters):

for param in parameter:

bandits.append(generator(pow(2, param)))



_, average_rewards = simulate(runs, time, bandits)

rewards = np.mean(average_rewards, axis=1)



i = 0

for label, parameter in zip(labels, parameters):

l = len(parameter)

plt.plot(parameter, rewards[i:i+l], label=label)

i += l

plt.xlabel('Parameter($2^x$)')

plt.ylabel('Average reward')

plt.legend()



plt.savefig('../images/figure_2_6.png')

plt.close()




if __name__ == '__main__':

figure_2_1()

figure_2_2()

figure_2_3()

figure_2_4()

figure_2_5()
figure_2_6()

The 10-armed Testbed

figure 2.1

results of first experiments
figure 2.2

epsilon 설정에 따른 성능 변화

the life of a bandit agent
1. select an action (arm) At
2. world sends back a reward (payoff) Rt for action At
3. update estimate of Q(At)
4. goto 1

Incremental Implementation

Trick #1 :

this has a standard form used over and over in RL
estimate = old est. + step-size*[target-old est.]
where the step-size is 1/n
Q1 is of initial estimate (guess) at the arm’s value

핵심 개념:

Incremental Estimation:
- 계산량과 메모리 사용을 각 스텝마다 일정하게 유지하려는 방법입니다. 이 접근법은 강화 학습(Reinforcement Learning, RL)에서 자주 사용되는 주제입니다.
Single Action에 대한 고려:
- 여기서는 단일 행동에 대한 가치 추정을 중점적으로 다룹니다.
Sum과 Counter의 증가:
- 초기의 기본적인 추정 방법은 각 행동에 대한 보상의 합계를 유지하고, 해당 행동이 취해진 횟수로 나누는 것입니다.
- 수식:
  $$ Q_{n+1} = \frac{1}{n} \sum_{i=1}^{n} R_i $$
  여기서 ( R_i )는 i번째 선택에서 얻은 보상을 나타냅니다.
Incremental Learning Rule:
- 보다 효율적인 방법은 증분 학습 규칙을 사용하는 것입니다.
- 이 규칙은 현재 추정치 ( Q_n )에 새로운 보상 ( R_n )과의 차이를 조정값으로 사용하여 업데이트합니다.
- 수식:
  $$ Q_{n+1} = Q_n + \frac{1}{n} [R_n – Q_n] $$
  이 수식은 새로운 보상과 현재 추정치 사이의 차이를 이용하여 추정치를 조정하므로, 모든 보상의 합계를 유지할 필요가 없습니다.

designed to be robust to small fluctuations

Tracking a Nonstationary Problem

Trick #2:

non- stationary

시계열 데이터는 시간의 흐름에 따라 평균이나 분산 등의 통계적 특성이 변하지 않고, 일정한 추세가 없는 정상성(Stationary) 데이터와 시간에 따라 통계적 특성이 변화하는 비정상성(Non-Stationary) 데이터로 나눌 수 있습니다.

in such tasks the sample (long-run) average is not a good idea.

$$
Q_{n+1} = (1 – \alpha)^n Q_1 + \sum_{i=1}^{n} \alpha (1 – \alpha)^{n-i} R_i
$$

Optimistic Initial Values

figure 2.3

Optimistic Initial Values

Trick #3 :

the bias of Q1 can be used to our advantage
until now Q1(a) =0 for all a
however, we could set Q1(a) higher than we think q*(a) actually is, or optimistically

Bias of 𝑄1: 여기서 𝑄1은 어떤 행동 𝑎에 대한 예상 보상 값을 나타냅니다. 이 값은 에이전트가 초기에 어떤 행동을 선택할 때 얼마나 그 행동이 좋다고 생각하는지를 나타냅니다.
Until now 𝑄1 𝑎 = 0 for all 𝑎: 초기 상태에서 모든 행동에 대한 예상 보상 값은 0이라고 설정되었다는 것을 나타냅니다.
Optimistic Initialization: 이 트릭의 핵심 아이디어는 모든 행동을 초기에는 ‘좋다’고 가정하는 것입니다. 즉, 𝑄1 값을 그 행동의 실제 예상 보상 값보다 더 높게 설정합니다.
Exploration: 강화학습에서 에이전트는 환경을 탐험(Exploration)하면서 최적의 행동을 찾아나갑니다. 만약 에이전트가 모든 행동이 좋다고 초기에 생각한다면, 그는 모든 가능한 행동을 시도해볼 것입니다. 이러한 방식으로, 에이전트는 더 많은 행동을 탐험하게 됩니다.
Repeatedly Disappointed: 에이전트가 각 행동을 시도하면서 실제 보상을 얻게 되면, 초기에 설정한 ‘낙관적인’ 예상 보상 값과 실제 보상 값 사이에 차이를 발견하게 됩니다. 그 결과, 에이전트는 그 차이를 줄이기 위해 자신의 예상 보상 값을 조정하게 됩니다.
Optimism in the Face of Uncertainty: 이 트릭의 이름으로, 에이전트가 처음에는 환경에 대한 정보가 없기 때문에 모든 행동에 낙관적으로 접근하는 것을 의미합니다.

요약하면, 초기화 트릭은 에이전트가 더 많은 탐험을 하게 만들어, 최적의 행동을 더 빨리 찾아낼 수 있게 도와줍니다. 이는 특히 초기에 에이전트가 환경에 대한 정보가 거의 없을 때 유용합니다.

Upper Confidence Bound Action Selection

수식 :
$$
A_t = \text{argmax}_a \left( Q_t(a) + \sqrt{\frac{ \log t}{N_t(a)}} \right)
$$
Nt : number of times a has been selected
c>0 is the exploration parameter
when Qt’s for different actions are similar, we will favor actions that have not been selected often

figure 2.4

Reinforcement learning: an introduction [1] Multi Armed Bandit