청구기호 |
Q325.6 .S888 2018 |
판사항 |
2nd ed.
|
형태사항 |
1 online resource (591 pages)
|
총서명 |
Adaptive Computation and Machine Learning Ser.
|
언어 |
English |
내용 |
Intro -- Series Page -- Title Page -- Copyright -- Dedication -- Table of Contents -- Preface to the Second Edition -- Preface to the First Edition -- Summary of Notation -- 1. Introduction -- 1.1. Reinforcement Learning -- 1.2. Examples -- 1.3. Elements of Reinforcement Learning -- 1.4. Limitations and Scope -- 1.5. An Extended Example: Tic-Tac-Toe -- 1.6. Summary -- 1.7. Early History of Reinforcement Learning -- I: Tabular Solution Methods -- 2. Multi-armed Bandits -- 2.1. A k-armed Bandit Problem -- 2.2. Action-value Methods -- 2.3. The 10-armed Testbed -- 2.4. Incremental Implementation -- 2.5. Tracking a Nonstationary Problem -- 2.6. Optimistic Initial Values -- 2.7. Upper-Confidence-Bound Action Selection -- 2.8. Gradient Bandit Algorithms -- 2.9. Associative Search (Contextual Bandits) -- 2.10 Summary -- 3. Finite Markov Decision Processes -- 3.1. The Agent-Environment Interface -- 3.2. Goals and Rewards -- 3.3. Returns and Episodes -- 3.4. Unified Notation for Episodic and Continuing Tasks -- 3.5. Policies and Value Functions -- 3.6. Optimal Policies and Optimal Value Functions -- 3.7. Optimality and Approximation -- 3.8. Summary -- 4. Dynamic Programming -- 4.1. Policy Evaluation (Prediction) -- 4.2. Policy Improvement -- 4.3. Policy Iteration -- 4.4. Value Iteration -- 4.5. Asynchronous Dynamic Programming -- 4.6. Generalized Policy Iteration -- 4.7. Efficiency of Dynamic Programming -- 4.8. Summary -- 5. Monte Carlo Methods -- 5.1. Monte Carlo Prediction -- 5.2. Monte Carlo Estimation of Action Values -- 5.3. Monte Carlo Control -- 5.4. Monte Carlo Control without Exploring Starts -- 5.5. Off-policy Prediction via Importance Sampling -- 5.6. Incremental Implementation -- 5.7. Off-policy Monte Carlo Control -- 5.8. *Discounting-aware Importance Sampling -- 5.9. *Per-decision Importance Sampling -- 5.10. Summary.
6. Temporal-Difference Learning -- 6.1. TD Prediction -- 6.2. Advantages of TD Prediction Methods -- 6.3. Optimality of TD(0) -- 6.4. Sarsa: On-policy TD Control -- 6.5. Q-learning: Off-policy TD Control -- 6.6. Expected Sarsa -- 6.7. Maximization Bias and Double Learning -- 6.8. Games, Afterstates, and Other Special Cases -- 6.9. Summary -- 7. n-step Bootstrapping -- 7.1. n-step TD Prediction -- 7.2. n-step Sarsa -- 7.3. n-step Off-policy Learning -- 7.4. *Per-decision Methods with Control Variates -- 7.5. Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm -- 7.6. *A Unifying Algorithm: n-step Q(σ) -- 7.7. Summary -- 8. Planning and Learning with Tabular Methods -- 8.1. Models and Planning -- 8.2. Dyna: Integrated Planning, Acting, and Learning -- 8.3. When the Model Is Wrong -- 8.4. Prioritized Sweeping -- 8.5. Expected vs. Sample Updates -- 8.6. Trajectory Sampling -- 8.7. Real-time Dynamic Programming -- 8.8. Planning at Decision Time -- 8.9. Heuristic Search -- 8.10. Rollout Algorithms -- 8.11. Monte Carlo Tree Search -- 8.12. Summary of the Chapter -- 8.13. Summary of Part I: Dimensions -- II: Approximate Solution Methods -- 9. On-policy Prediction with Approximation -- 9.1. Value-function Approximation -- 9.2. The Prediction Objective (VE) -- 9.3. Stochastic-gradient and Semi-gradient Methods -- 9.4. Linear Methods -- 9.5. Feature Construction for Linear Methods -- 9.5.1. Polynomials -- 9.5.2. Fourier Basis -- 9.5.3. Coarse Coding -- 9.5.4. Tile Coding -- 9.5.5. Radial Basis Functions -- 9.6. Selecting Step-Size Parameters Manually -- 9.7. Nonlinear Function Approximation: Artificial Neural Networks -- 9.8. Least-Squares TD -- 9.9. Memory-based Function Approximation -- 9.10. Kernel-based Function Approximation -- 9.11. Looking Deeper at On-policy Learning: Interest and Emphasis -- 9.12. Summary.
10. On-policy Control with Approximation -- 10.1. Episodic Semi-gradient Control -- 10.2. Semi-gradient n-step Sarsa -- 10.3. Average Reward: A New Problem Setting for Continuing Tasks -- 10.4. Deprecating the Discounted Setting -- 10.5. Differential Semi-gradient n-step Sarsa -- 10.6. Summary -- 11. *Off-policy Methods with Approximation -- 11.1. Semi-gradient Methods -- 11.2. Examples of Off-policy Divergence -- 11.3. The Deadly Triad -- 11.4. Linear Value-function Geometry -- 11.5. Gradient Descent in the Bellman Error -- 11.6. The Bellman Error is Not Learnable -- 11.7. Gradient-TD Methods -- 11.8. Emphatic-TD Methods -- 11.9. Reducing Variance -- 11.10. Summary -- 12. Eligibility Traces -- 12.1. The λ-return -- 12.2. TD(λ) -- 12.3. n-step Truncated λ-return Methods -- 12.4. Redoing Updates: Online λ-return Algorithm -- 12.5. True Online TD(λ) -- 12.6. *Dutch Traces in Monte Carlo Learning -- 12.7. Sarsa(λ) -- 12.8. Variable λ and γ -- 12.9. Off-policy Traces with Control Variates -- 12.10. Watkins's Q(λ) to Tree-Backup(λ) -- 12.11. Stable Off-policy Methods with Traces -- 12.12. Implementation Issues -- 12.13. Conclusions -- 13. Policy Gradient Methods -- 13.1. Policy Approximation and its Advantages -- 13.2. The Policy Gradient Theorem -- 13.3. REINFORCE: Monte Carlo Policy Gradient -- 13.4. REINFORCE with Baseline -- 13.5. Actor-Critic Methods -- 13.6. Policy Gradient for Continuing Problems -- 13.7. Policy Parameterization for Continuous Actions -- 13.8. Summary -- III: Looking Deeper -- 14. Psychology -- 14.1. Prediction and Control -- 14.2. Classical Conditioning -- 14.2.1. Blocking and Higher-order Conditioning -- 14.2.2. The Rescorla-Wagner Model -- 14.2.3. The TD Model -- 14.2.4. TD Model Simulations -- 14.3. Instrumental Conditioning -- 14.4. Delayed Reinforcement -- 14.5. Cognitive Maps -- 14.6. Habitual and Goal-directed Behavior.
14.7. Summary -- 15. Neuroscience -- 15.1. Neuroscience Basics -- 15.2. Reward Signals, Reinforcement Signals, Values, and Prediction Errors -- 15.3. The Reward Prediction Error Hypothesis -- 15.4. Dopamine -- 15.5. Experimental Support for the Reward Prediction Error Hypothesis -- 15.6. TD Error/Dopamine Correspondence -- 15.7. Neural Actor-Critic -- 15.8. Actor and Critic Learning Rules -- 15.9. Hedonistic Neurons -- 15.10. Collective Reinforcement Learning -- 15.11. Model-based Methods in the Brain -- 15.12. Addiction -- 15.13. Summary -- 16. Applications and Case Studies -- 16.1. TD-Gammon -- 16.2. Samuel's Checkers Player -- 16.3. Watson's Daily-Double Wagering -- 16.4. Optimizing Memory Control -- 16.5. Human-level Video Game Play -- 16.6. Mastering the Game of Go -- 16.6.1. AlphaGo -- 16.6.2. AlphaGo Zero -- 16.7. Personalized Web Services -- 16.8. Thermal Soaring -- 17. Frontiers -- 17.1. General Value Functions and Auxiliary Tasks -- 17.2. Temporal Abstraction via Options -- 17.3. Observations and State -- 17.4. Designing Reward Signals -- 17.5. Remaining Issues -- 17.6. Experimental Support for the Reward Prediction Error Hypothesis -- References -- Index.
|
주제 |
Reinforcement learning.
|
보유판 및 특별호 저록 |
Print version: Sutton, Richard S. Reinforcement Learning, Second Edition Cambridge : MIT Press,c2018 9780262039246
|
ISBN |
9780262352703, 9780262039246
|
QR CODE |
|