Top > Search of International Patents > Solution search for combinatorial bandit problem

Solution search for combinatorial bandit problem

Foreign code F190009916
File No. 08135-US
Posted date Aug 26, 2019
Country United States of America
Application number 201414780166
Gazette No. 20160042291
Gazette No. 10452988
Date of filing Mar 17, 2014
Gazette Date Feb 11, 2016
Gazette Date Oct 22, 2019
International application number JP2014001506
International publication number WO2014156044
Date of international filing Mar 17, 2014
Date of international publication Oct 2, 2014
Priority data
  • P2013-066768 (Mar 27, 2013) JP
  • 2014JP01506 (Mar 17, 2014) WO
Title Solution search for combinatorial bandit problem
Abstract A solution search system, which can search for a test object, expected to output an optimal result, from at least two test objects each outputting a result based on a probability distribution, includes: a record superiority comparing unit which can obtain past records of the test objects based on an accumulation of the output results and can compare all records for superiority/inferiority; a controlling unit which can control a measurement variable of the test objects, based on the compared superiority/inferiority records, and a latest result output from the test object; and an output instructing unit which can instruct the test object, the measurement variable of which has exceeded a threshold value, to output a result, wherein the output instructing unit determines, as a desired solution, at least one test object to which the largest number of the output instructions have finally been given after repetition of the output instructions.
Outline of related art and contending technology BACKGROUND ART
Conventionally, a bandit problem has been known as a representative example of a problem of searching for a solution capable of maximizing an expected value. A purpose of the bandit problem is to maximize an expected value of the total of rewards one can receive. In the bandit problem, a player repeats choosing, by a certain action, one option from n-types of different options of action. After each selection, a result selected from a probability distribution which depends on the selected action is given to the player as a reward.
Take the following case as an example: There are several slot machines, and a player can get a coin (a reward) by pulling a lever of the machine, under a certain probability distribution. The probability distribution (a winning rate) of getting a coin differs among each of the slot machines, and the player has no knowledge about the winning rate. In such case, a most common method for evaluating the winning rate of each of the slot machines is simply to play each of the slot machines multiple times one after another. The slot machine which actually provides the highest reward is determined to be the machine of the highest winning rate.
In the above method, however, the player has to play the slot machines for a considerable number of times in order to determine a machine with the highest winning rate in fact. This, as a result, requires a large investment. It is necessary, therefore, to create an algorithm capable of finding a solution efficiently, as well as reducing an investment as much as possible, in searching for the winning rate of each of the slot machines.
To find a solution, the above case can be applied to the bandit problem described above, which is a problem of searching for a solution capable of maximizing an expected value of the total of rewards (for example, refer to Non Patent Literature 1). In particular, a combinatorial bandit problem has recently been drawing attention. In the combinatorial bandit problem, a combination of options which is expected to output an optimal result is selected from n-types of different options of action. There is a need for the combinatorial bandit problem in various fields, other than selecting a combination of slot machines which is likely to provide a high dividend from among a plurality of slot machines. For example, the combinatorial bandit problem can be used in selecting an optimal channel combination which is able to maximize the amount of data transmission in cognitive radio communication, an optimal advertisement combination to maximize the click count in Internet advertising, and a portfolio of financial instruments with the highest return on investment. In these applications, the bandit problem is a commoner type of a combinatorial reward maximization problem. In other words, there are multiple players, and the amount of reward for each player is determined depending on the choice by the player (for example, by a payoff matrix). In the present description, however, a combinatorial reward maximization problem in which each slot machine is independent of each other (particularly, a combination of two slot machines) is described with examples for simplification.
Scope of claims [claim1]
1. A solution search system configured to search for a test object expected to output an optimal result, from among at least two test objects each of which outputs a result based on a probability distribution, the solution search system comprising:
a hardware processor or circuitry configured to:
obtain a past record for each of the test objects, based on an accumulation of the results having been output, and compare the records of the test objects in terms of superiority and inferiority in relation to the records of all the test objects;
perform control to increase or decrease a measurement variable for each of the test objects, based on the superiority/inferiority of the record having been compared, and a latest result having been output from the test object;
instruct the test object, the measurement variable of which has exceeded a threshold value, to output a result; and
determine, as a desired solution, at least one test object to which the largest number of the output instructions have finally been given after repetition of the output instructions.

[claim2]
2. The solution search system according to claim 1, the solution search system being further configured to search for a combination of test objects which is expected to output an optimal result, from among at least three test objects each of which outputs a result based on a predetermined probability distribution,
wherein the hardware processor or circuitry is configured to determine, as a desired solution, a combination of test objects to which the largest number of the output instructions have been given after repetition of the output instructions.

[claim3]
3. The solution search system according to claim 1, the solution search system being further configured to search for a combination of test objects, each of the test objects having a probability distribution which varies chronologically in response to an output instruction,
wherein when searching for the solution, at least two solution-search systems are used.

[claim4]
4. The solution search system according to claim 1, wherein hardware processor or circuitry is configured to:
raise the past record of one test object to a superior level, when the test object has output a superior result, while the other test objects have output inferior results; and
lower the past record of one test object to an inferior level, when the test object has output an inferior result, while the other test objects have output superior results.

[claim5]
5. The solution search system according to claim 1, wherein the hardware processor or circuitry, which utilizes an internal resource value which is a difference between the record of one test object and the mean of the records of all the test objects, is configured to:
increase the measurement variable, when the internal resource value is positive, and the latest result having been output from the test object is superior;
maintain the measurement variable, when the internal resource value is positive, and the latest result having been output from the test object is inferior;
increase the measurement variable, when the internal resource value is zero, and the latest result having been output from the test object is superior;
decrease the measurement variable, when the internal resource value is zero, and the latest result having been output from the test object is inferior;
maintain the measurement variable, when the internal resource value is negative, and the latest result having been output from the test object is superior; and
decrease the measurement variable, when the internal resource value is negative, and the latest result having been output from the test object is inferior.

[claim6]
6. The solution search system according to claim 1, wherein the hardware processor or circuitry is configured to perform control so that the sum of the measurement variables which are assigned to the respective test objects can be kept constant.

[claim7]
7. A non-transitory computer-readable medium storing a solution search program thereon, the solution search program being executable by a hardware processor to control the hardware processor to search for a test object expected to output an optimal result, from among at least two test objects each of which outputs a result based on a probability distribution, by controlling the hardware processor to execute processes comprising:
record superiority comparing, in which a past record is obtained for each of the test objects, based on an accumulation of the results having been output, and the records of the test objects are compared in terms of superiority and inferiority in relation to the records of all the test objects;
controlling, in which control is performed to increase or decrease a measurement variable for each of the test objects, based on the superiority/inferiority of the record having been compared by the record superiority comparing, and a latest result having been output from the test object; and
output instructing, in which the test object, the measurement variable of which has exceeded a threshold value, is instructed to output a result,
wherein in the output instructing, the computer is caused to determine, as a desired solution, at least one test object to which the largest number of the output instructions have finally been given after repetition of the output instructions.

[claim8]
8. The non-transitory computer-readable medium according to claim 7, wherein the record superiority comparing comprises:
raising the past record of one test object to a superior level, when the test object has output a superior result, while the other test objects have output inferior results; and
lowering the past record of one test object to an inferior level, when the test object has output an inferior result, while the other test objects have output superior results.

[claim9]
9. The non-transitory computer-readable medium according to claim 7, wherein:
the record superiority comparing utilizes an internal resource value which is a difference between the record of one test object and the mean of the records of all the test objects, and
the controlling comprises:
increasing the measurement variable, when the internal resource value is positive, and the latest result having been output from the test object is superior;
maintaining the measurement variable, when the internal resource value is positive, and the latest result having been output from the test object is inferior;
increasing the measurement variable, when the internal resource value is zero, and the latest result having been output from the test object is superior;
decreasing the measurement variable, when the internal resource value is zero, and the latest result having been output from the test object is inferior;
maintaining the measurement variable, when the internal resource value is negative, and the latest result having been output from the test object is superior; and
decreasing the measurement variable, when the internal resource value is negative, and the latest result having been output from the test object is inferior.

[claim10]
10. The non-transitory computer-readable medium according to claim 7, wherein in the controlling, control is performed so that the sum of the measurement variables which are assigned to the respective test objects can be kept constant.

[claim11]
11. A solution search method for searching for a test object expected to output an optimal result, from among at least two test objects each of which outputs a result based on a probability distribution, the solution search method being executed by a hardware processor or circuitry, and the solution search method comprising:
record superiority comparing, in which a past record is obtained for each of the test objects, based on an accumulation of the results having been output, and the records of the test objects are compared in terms of superiority and inferiority in relation to the records of all the test objects;
controlling, in which control is performed to increase or decrease a measurement variable for each of the test objects, based on the superiority/inferiority of the record having been compared by the record superiority comparing, and a latest result having been output from the test object; and
output instructing, in which the test object, the measurement variable of which has exceeded a threshold value, is instructed to output a result,
wherein in the output instructing, at least one test object to which the largest number of the output instructions have finally been given after repetition of the output instructions of the result is determined as a desired solution.

[claim12]
12. The solution search method according to claim 11, wherein the record superiority comparing comprises:
raising the past record of one test object to a superior level, when the test object has output a superior result, while the other test objects have output inferior results; and
lowering the past record of one test object to an inferior level, when the test object has output an inferior result, while the other test objects have output superior results.

[claim13]
13. The solution search method according to claim 11, wherein:
the step of record superiority comparing utilizes an internal resource value which is a difference between the record of one test object and the mean of the records of all the test objects, and
the controlling comprises:
increasing the measurement variable, when the internal resource value is positive, and the latest result having been output from the test object is superior;
maintaining the measurement variable, when the internal resource value is positive, and the latest result having been output from the test object is inferior;
increasing the measurement variable, when the internal resource value is zero, and the latest result having been output from the test object is superior;
decreasing the measurement variable, when the internal resource value is zero, and the latest result having been output from the test object is inferior;
maintaining the measurement variable, when the internal resource value is negative, and the latest result having been output from the test object is superior; and
decreasing the measurement variable, when the internal resource value is negative, and the latest result having been output from the test object is inferior.

[claim14]
14. The solution search method according to claim 11, wherein in the controlling, control is performed so that the sum of the measurement variables which are assigned to the respective test objects can be kept constant.
  • Inventor, and Inventor/Applicant
  • KIM Song-Ju
  • AONO Masashi
  • NAMEDA Etsushi
  • HARA Masahiko
  • RIKEN
IPC(International Patent Classification)

PAGE TOP

close
close
close
close
close
close