create account

强化学习——Q-Learning SARSA 玩CarPole经典游戏 by hongtao

View this thread on steemit.com
· @hongtao ·
$1.95
强化学习——Q-Learning SARSA 玩CarPole经典游戏
![img](http://ww4.sinaimg.cn/large/006tNc79gy1g4stgr2nb7j30rs0kt45l.jpg)

*Image from [unsplash.com](https://unsplash.com/photos/i3mcVZQObcU) by Ferdinand Stöhr*

前文我们讲了如何用Q-learning 和 SARSA 玩推小车上山的游戏,这篇文章我们探讨一下如何完成Carpole平衡杆的游戏。

同样的,为了方便与读者交流,所有的代码都放在了这里:

https://github.com/zht007/tensorflow-practice

### 1. 环境分析

关于cartPole 游戏的介绍参见之前[这篇文章](https://steemit.com/cn-stem/@hongtao/dqn-q-learning),这里就不赘述了。通过阅读官方文档,Open AI 的 [CartPole v0](https://github.com/openai/gym/wiki/CartPole-v0) 可以发现,与[MountainCar-v0](https://github.com/openai/gym/wiki/MountainCar-v0) 最大的区别是,CartPole 的状态有四个维度,分别是位置,速度,夹角和角速度。其中,速度和角速度的范围是正负无穷大。我们知道Q-learning 和 SARSA 都依赖有限的表示非连续状态的策略(Q-表),如何将无限连续的状态分割成有限不限连续的状态呢?

这里我们可以使用在神经网络中被曾被广泛应用的 sigmoid 函数,该函数可以将无限的范围投射在0到1之间。所以我们先建立这个 sigmoid 帮助函数。

```python
def sigmoid(x):
  return 1 / (1 + np.exp(-x))
```

### 2. 建立Q-表

与MountainCar 类似需要将连续的状态切割成离散的状态,不同的是速度和角速度需要用sigmoid 函数投射在有限的范围内。

```python
DISCRETE_OS_SIZE = [Q_TABLE_LEN] * (len(env.observation_space.high))


observation_high = np.array([env.observation_space.high[0],
                    Q_TABLE_LEN*sigmoid(env.observation_space.high[1]),
                    env.observation_space.high[2],
                    Q_TABLE_LEN*sigmoid(env.observation_space.high[3])])

observation_low = np.array([env.observation_space.low[0],
                    Q_TABLE_LEN*sigmoid(env.observation_space.low[1]),
                    env.observation_space.low[2],
                    Q_TABLE_LEN*sigmoid(env.observation_space.low[3])])

discrete_os_win_size = (observation_high - observation_low) / DISCRETE_OS_SIZE
```

*Code from [github repo](https://github.com/zht007/tensorflow-practice/blob/master/10_Renforcement_Learning_Moutain_Car/1_q_learning_python_mountain_car.ipynb) with MIT license* 

值得注意的是,由于Q-表的维度比较高,这里将其参数直接设置为0,否则随机产生150 * 150 *150 *2 个数需要花费很长时间。另外 Q_TABLE_LEN 我设置的是150 (大约占用6G的内存),过大的Q-表长度会导致内存溢出。

```python
q_table = np.zeros((DISCRETE_OS_SIZE + [env.action_space.n]))
```



### 3. Q - Learning 和 SARSA 

后面的代码与 MountainCar 几乎一模一样,这里就不赘述了,可参考[前文](https://steemit.com/cn-stem/@hongtao/q-learning-mountaincar)。可以发现两者区别不大,均很好地完成了任务。

![image-20190708154308650](http://ww2.sinaimg.cn/large/006tNc79gy1g4stak2o2mj312m0ek78t.jpg)

理论上来说,SARSA lambda 也是可以使用的,但是由于智能体每走一步均需要更新整个Q表,然而该表又实在太大实践起来计算量非常之巨大,感兴趣的读者可自行尝试。

------

参考资料

[1] [Reinforcement Learning: An Introduction (2nd Edition)](http://incompleteideas.net/book/RLbook2018.pdf)

[2] [David Silver's Reinforcement Learning Course (UCL, 2015)](http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html)

[3] [Github repo: Reinforcement Learning](https://github.com/dennybritz/reinforcement-learning)

------

相关文章

[强化学习—— SARSA 和 SARSA lambda 玩 MountainCar 爬坡上山](https://steemit.com/cn-stem/@hongtao/sarsa-sarsa-lambda-mountaincar)

[强化学习—— Q-Learning 玩 MountainCar 爬坡上山](https://steemit.com/cn-stem/@hongtao/q-learning-mountaincar)

[强化学习——MC(蒙特卡洛)玩21点扑克游戏](https://steemit.com/cn-stem/@hongtao/mc-21)

[强化学习实战——动态规划(DP)求最优MDP](https://steemit.com/cn-stem/@hongtao/dp-mdp)

[强化学习——强化学习的算法分类](https://steemit.com/ai/@hongtao/7atbof)

[强化学习——重拾强化学习的核心概念](https://steemit.com/ai/@hongtao/2bqdkd)

[AI学习笔记——Sarsa算法](https://steemit.com/ai/@hongtao/ai-sarsa)

[AI学习笔记——Q Learning](https://steemit.com/ai/@hongtao/ai-q-learning)

[AI学习笔记——动态规划(Dynamic Programming)解决MDP(1)](https://steemit.com/ai/@hongtao/ai-dynamic-programming-mdp-1)

[AI学习笔记——动态规划(Dynamic Programming)解决MDP(2)](https://steemit.com/ai/@hongtao/ai-dynamic-programming-mdp-2)

[AI学习笔记——MDP(Markov Decision Processes马可夫决策过程)简介](https://steemit.com/ai/@hongtao/ai-mdp-markov-decision-processes)

[AI学习笔记——求解最优MDP](https://steemit.com/ai/@hongtao/ai-mdp)

------

同步到我的简书 https://www.jianshu.com/u/bd506afc6fc1
👍  , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , and 342 others
properties (23)
post_id77,669,290
authorhongtao
permlinkq-learning-sarsa-carpole
categorycn-stem
json_metadata{"community":"busy","app":"busy\/2.5.6","format":"markdown","tags":["cn-stem","ai","renforcement-learning","cn","busy"],"users":["hongtao"],"links":["https:\/\/unsplash.com\/photos\/i3mcVZQObcU","https:\/\/github.com\/zht007\/tensorflow-practice","https:\/\/steemit.com\/cn-stem\/@hongtao\/dqn-q-learning","https:\/\/github.com\/openai\/gym\/wiki\/CartPole-v0","https:\/\/github.com\/openai\/gym\/wiki\/MountainCar-v0","https:\/\/github.com\/zht007\/tensorflow-practice\/blob\/master\/10_Renforcement_Learning_Moutain_Car\/1_q_learning_python_mountain_car.ipynb","https:\/\/steemit.com\/cn-stem\/@hongtao\/q-learning-mountaincar","http:\/\/incompleteideas.net\/book\/RLbook2018.pdf","http:\/\/www0.cs.ucl.ac.uk\/staff\/d.silver\/web\/Teaching.html","https:\/\/github.com\/dennybritz\/reinforcement-learning"],"image":["http:\/\/ww4.sinaimg.cn\/large\/006tNc79gy1g4stgr2nb7j30rs0kt45l.jpg","http:\/\/ww2.sinaimg.cn\/large\/006tNc79gy1g4stak2o2mj312m0ek78t.jpg"]}
created2019-07-08 14:52:27
last_update2019-07-08 14:52:27
depth0
children3
net_rshares4,751,213,520,329
last_payout2019-07-15 14:52:27
cashout_time1969-12-31 23:59:59
total_payout_value1.484 SBD
curator_payout_value0.462 SBD
pending_payout_value0.000 SBD
promoted0.000 SBD
body_length3,714
author_reputation2,062,739,781,884
root_title"强化学习——Q-Learning SARSA 玩CarPole经典游戏"
beneficiaries[]
max_accepted_payout1,000,000.000 SBD
percent_steem_dollars10,000
author_curate_reward""
vote details (406)
@cnbuddy ·
吃了吗?你好!家中可愛的寵物照想要跟大家分享嗎?或是出去玩拍到一些可愛的動物,別忘了到@dpet分享,可以得到@dpet的獎勵喔!倘若你想让我隐形,请回复“取消”。
properties (22)
post_id77,670,514
authorcnbuddy
permlinkre-hongtao-q-learning-sarsa-carpole-20190708t151820848z
categorycn-stem
json_metadata{}
created2019-07-08 15:18:21
last_update2019-07-08 15:18:21
depth1
children0
net_rshares0
last_payout2019-07-15 15:18:21
cashout_time1969-12-31 23:59:59
total_payout_value0.000 SBD
curator_payout_value0.000 SBD
pending_payout_value0.000 SBD
promoted0.000 SBD
body_length82
author_reputation565,226,118,882
root_title"强化学习——Q-Learning SARSA 玩CarPole经典游戏"
beneficiaries[]
max_accepted_payout1,000,000.000 SBD
percent_steem_dollars10,000
@steemitboard ·
Congratulations @hongtao! You have completed the following achievement on the Steem blockchain and have been rewarded with new badge(s) :

<table><tr><td><img src="https://steemitimages.com/60x70/http://steemitboard.com/@hongtao/voted.png?201907082106"></td><td>You received more than 10000 upvotes. Your next target is to reach 15000 upvotes.</td></tr>
</table>

<sub>_You can view [your badges on your Steem Board](https://steemitboard.com/@hongtao) and compare to others on the [Steem Ranking](https://steemitboard.com/ranking/index.php?name=hongtao)_</sub>
<sub>_If you no longer want to receive notifications, reply to this comment with the word_ `STOP`</sub>



> You can upvote this notification to help all Steem users. Learn how [here](https://steemit.com/steemitboard/@steemitboard/http-i-cubeupload-com-7ciqeo-png)!
properties (22)
post_id77,688,996
authorsteemitboard
permlinksteemitboard-notify-hongtao-20190708t221814000z
categorycn-stem
json_metadata{"image":["https:\/\/steemitboard.com\/img\/notify.png"]}
created2019-07-08 22:18:12
last_update2019-07-08 22:18:12
depth1
children0
net_rshares0
last_payout2019-07-15 22:18:12
cashout_time1969-12-31 23:59:59
total_payout_value0.000 SBD
curator_payout_value0.000 SBD
pending_payout_value0.000 SBD
promoted0.000 SBD
body_length826
author_reputation33,282,981,394,546
root_title"强化学习——Q-Learning SARSA 玩CarPole经典游戏"
beneficiaries[]
max_accepted_payout1,000,000.000 SBD
percent_steem_dollars10,000
@steemstem ·
re-hongtao-q-learning-sarsa-carpole-20190711t123836059z
<div class='text-justify'> <div class='pull-left'> <center> <br /> <img width='200' src='https://res.cloudinary.com/drrz8xekm/image/upload/v1553698283/weenlqbrqvvczjy6dayw.jpg'> </center>  <br/> </div> 

This post has been voted on by the **SteemSTEM** curation team and voting trail. It is elligible for support from <b><a href='https://www.steemstem.io/#!/@curie'>@curie</a></b>.<br /> 

If you appreciate the work we are doing, then consider supporting our witness [**stem.witness**](https://steemconnect.com/sign/account_witness_vote?approve=1&witness=stem.witness). Additional witness support to the [**curie witness**](https://steemconnect.com/sign/account_witness_vote?approve=1&witness=curie) would be appreciated as well.<br /> 

For additional information please join us on the [**SteemSTEM discord**]( https://discord.gg/BPARaqn) and to get to know the rest of the community!<br />

Please consider setting <b><a href='https://www.steemstem.io/#!/@steemstem'>@steemstem</a></b> as a beneficiary to your post to get a stronger support.<br />

Please consider using the <b><a href='https://www.steemstem.io'>steemstem.io</a></b> app to get a stronger support.</div>
properties (22)
post_id77,814,083
authorsteemstem
permlinkre-hongtao-q-learning-sarsa-carpole-20190711t123836059z
categorycn-stem
json_metadata{"app":"bloguable-bot"}
created2019-07-11 12:38:39
last_update2019-07-11 12:38:39
depth1
children0
net_rshares0
last_payout2019-07-18 12:38:39
cashout_time1969-12-31 23:59:59
total_payout_value0.000 SBD
curator_payout_value0.000 SBD
pending_payout_value0.000 SBD
promoted0.000 SBD
body_length1,174
author_reputation187,643,417,195,608
root_title"强化学习——Q-Learning SARSA 玩CarPole经典游戏"
beneficiaries[]
max_accepted_payout1,000,000.000 SBD
percent_steem_dollars10,000