TOP > 外国特許検索 > Solution search for combinatorial bandit problem

# Solution search for combinatorial bandit problem

外国特許コード F190009916 08135-US 2019年8月26日 アメリカ合衆国 201414780166 20160042291 10452988 平成26年3月17日(2014.3.17) 平成28年2月11日(2016.2.11) 令和元年10月22日(2019.10.22) JP2014001506 WO2014156044 平成26年3月17日(2014.3.17) 平成26年10月2日(2014.10.2) 特願2013-066768 (2013.3.27) JP 2014JP01506 (2014.3.17) WO Solution search for combinatorial bandit problem A solution search system, which can search for a test object, expected to output an optimal result, from at least two test objects each outputting a result based on a probability distribution, includes: a record superiority comparing unit which can obtain past records of the test objects based on an accumulation of the output results and can compare all records for superiority/inferiority; a controlling unit which can control a measurement variable of the test objects, based on the compared superiority/inferiority records, and a latest result output from the test object; and an output instructing unit which can instruct the test object, the measurement variable of which has exceeded a threshold value, to output a result, wherein the output instructing unit determines, as a desired solution, at least one test object to which the largest number of the output instructions have finally been given after repetition of the output instructions. BACKGROUND ARTConventionally, a bandit problem has been known as a representative example of a problem of searching for a solution capable of maximizing an expected value. A purpose of the bandit problem is to maximize an expected value of the total of rewards one can receive. In the bandit problem, a player repeats choosing, by a certain action, one option from n-types of different options of action. After each selection, a result selected from a probability distribution which depends on the selected action is given to the player as a reward.Take the following case as an example: There are several slot machines, and a player can get a coin (a reward) by pulling a lever of the machine, under a certain probability distribution. The probability distribution (a winning rate) of getting a coin differs among each of the slot machines, and the player has no knowledge about the winning rate. In such case, a most common method for evaluating the winning rate of each of the slot machines is simply to play each of the slot machines multiple times one after another. The slot machine which actually provides the highest reward is determined to be the machine of the highest winning rate.In the above method, however, the player has to play the slot machines for a considerable number of times in order to determine a machine with the highest winning rate in fact. This, as a result, requires a large investment. It is necessary, therefore, to create an algorithm capable of finding a solution efficiently, as well as reducing an investment as much as possible, in searching for the winning rate of each of the slot machines.To find a solution, the above case can be applied to the bandit problem described above, which is a problem of searching for a solution capable of maximizing an expected value of the total of rewards (for example, refer to Non Patent Literature 1). In particular, a combinatorial bandit problem has recently been drawing attention. In the combinatorial bandit problem, a combination of options which is expected to output an optimal result is selected from n-types of different options of action. There is a need for the combinatorial bandit problem in various fields, other than selecting a combination of slot machines which is likely to provide a high dividend from among a plurality of slot machines. For example, the combinatorial bandit problem can be used in selecting an optimal channel combination which is able to maximize the amount of data transmission in cognitive radio communication, an optimal advertisement combination to maximize the click count in Internet advertising, and a portfolio of financial instruments with the highest return on investment. In these applications, the bandit problem is a commoner type of a combinatorial reward maximization problem. In other words, there are multiple players, and the amount of reward for each player is determined depending on the choice by the player (for example, by a payoff matrix). In the present description, however, a combinatorial reward maximization problem in which each slot machine is independent of each other (particularly, a combination of two slot machines) is described with examples for simplification. [claim1]1. A solution search system configured to search for a test object expected to output an optimal result, from among at least two test objects each of which outputs a result based on a probability distribution, the solution search system comprising: a hardware processor or circuitry configured to: obtain a past record for each of the test objects, based on an accumulation of the results having been output, and compare the records of the test objects in terms of superiority and inferiority in relation to the records of all the test objects; perform control to increase or decrease a measurement variable for each of the test objects, based on the superiority/inferiority of the record having been compared, and a latest result having been output from the test object; instruct the test object, the measurement variable of which has exceeded a threshold value, to output a result; and determine, as a desired solution, at least one test object to which the largest number of the output instructions have finally been given after repetition of the output instructions.[claim2]2. The solution search system according to claim 1, the solution search system being further configured to search for a combination of test objects which is expected to output an optimal result, from among at least three test objects each of which outputs a result based on a predetermined probability distribution, wherein the hardware processor or circuitry is configured to determine, as a desired solution, a combination of test objects to which the largest number of the output instructions have been given after repetition of the output instructions.[claim3]3. The solution search system according to claim 1, the solution search system being further configured to search for a combination of test objects, each of the test objects having a probability distribution which varies chronologically in response to an output instruction, wherein when searching for the solution, at least two solution-search systems are used.[claim4]4. The solution search system according to claim 1, wherein hardware processor or circuitry is configured to: raise the past record of one test object to a superior level, when the test object has output a superior result, while the other test objects have output inferior results; and lower the past record of one test object to an inferior level, when the test object has output an inferior result, while the other test objects have output superior results.[claim5]5. The solution search system according to claim 1, wherein the hardware processor or circuitry, which utilizes an internal resource value which is a difference between the record of one test object and the mean of the records of all the test objects, is configured to: increase the measurement variable, when the internal resource value is positive, and the latest result having been output from the test object is superior; maintain the measurement variable, when the internal resource value is positive, and the latest result having been output from the test object is inferior; increase the measurement variable, when the internal resource value is zero, and the latest result having been output from the test object is superior; decrease the measurement variable, when the internal resource value is zero, and the latest result having been output from the test object is inferior; maintain the measurement variable, when the internal resource value is negative, and the latest result having been output from the test object is superior; and decrease the measurement variable, when the internal resource value is negative, and the latest result having been output from the test object is inferior.[claim6]6. The solution search system according to claim 1, wherein the hardware processor or circuitry is configured to perform control so that the sum of the measurement variables which are assigned to the respective test objects can be kept constant.[claim7]7. A non-transitory computer-readable medium storing a solution search program thereon, the solution search program being executable by a hardware processor to control the hardware processor to search for a test object expected to output an optimal result, from among at least two test objects each of which outputs a result based on a probability distribution, by controlling the hardware processor to execute processes comprising: record superiority comparing, in which a past record is obtained for each of the test objects, based on an accumulation of the results having been output, and the records of the test objects are compared in terms of superiority and inferiority in relation to the records of all the test objects; controlling, in which control is performed to increase or decrease a measurement variable for each of the test objects, based on the superiority/inferiority of the record having been compared by the record superiority comparing, and a latest result having been output from the test object; and output instructing, in which the test object, the measurement variable of which has exceeded a threshold value, is instructed to output a result, wherein in the output instructing, the computer is caused to determine, as a desired solution, at least one test object to which the largest number of the output instructions have finally been given after repetition of the output instructions.[claim8]8. The non-transitory computer-readable medium according to claim 7, wherein the record superiority comparing comprises: raising the past record of one test object to a superior level, when the test object has output a superior result, while the other test objects have output inferior results; and lowering the past record of one test object to an inferior level, when the test object has output an inferior result, while the other test objects have output superior results.[claim9]9. The non-transitory computer-readable medium according to claim 7, wherein: the record superiority comparing utilizes an internal resource value which is a difference between the record of one test object and the mean of the records of all the test objects, and the controlling comprises: increasing the measurement variable, when the internal resource value is positive, and the latest result having been output from the test object is superior; maintaining the measurement variable, when the internal resource value is positive, and the latest result having been output from the test object is inferior; increasing the measurement variable, when the internal resource value is zero, and the latest result having been output from the test object is superior; decreasing the measurement variable, when the internal resource value is zero, and the latest result having been output from the test object is inferior; maintaining the measurement variable, when the internal resource value is negative, and the latest result having been output from the test object is superior; and decreasing the measurement variable, when the internal resource value is negative, and the latest result having been output from the test object is inferior.[claim10]10. The non-transitory computer-readable medium according to claim 7, wherein in the controlling, control is performed so that the sum of the measurement variables which are assigned to the respective test objects can be kept constant.[claim11]11. A solution search method for searching for a test object expected to output an optimal result, from among at least two test objects each of which outputs a result based on a probability distribution, the solution search method being executed by a hardware processor or circuitry, and the solution search method comprising: record superiority comparing, in which a past record is obtained for each of the test objects, based on an accumulation of the results having been output, and the records of the test objects are compared in terms of superiority and inferiority in relation to the records of all the test objects; controlling, in which control is performed to increase or decrease a measurement variable for each of the test objects, based on the superiority/inferiority of the record having been compared by the record superiority comparing, and a latest result having been output from the test object; and output instructing, in which the test object, the measurement variable of which has exceeded a threshold value, is instructed to output a result, wherein in the output instructing, at least one test object to which the largest number of the output instructions have finally been given after repetition of the output instructions of the result is determined as a desired solution.[claim12]12. The solution search method according to claim 11, wherein the record superiority comparing comprises: raising the past record of one test object to a superior level, when the test object has output a superior result, while the other test objects have output inferior results; and lowering the past record of one test object to an inferior level, when the test object has output an inferior result, while the other test objects have output superior results.[claim13]13. The solution search method according to claim 11, wherein: the step of record superiority comparing utilizes an internal resource value which is a difference between the record of one test object and the mean of the records of all the test objects, and the controlling comprises: increasing the measurement variable, when the internal resource value is positive, and the latest result having been output from the test object is superior; maintaining the measurement variable, when the internal resource value is positive, and the latest result having been output from the test object is inferior; increasing the measurement variable, when the internal resource value is zero, and the latest result having been output from the test object is superior; decreasing the measurement variable, when the internal resource value is zero, and the latest result having been output from the test object is inferior; maintaining the measurement variable, when the internal resource value is negative, and the latest result having been output from the test object is superior; and decreasing the measurement variable, when the internal resource value is negative, and the latest result having been output from the test object is inferior.[claim14]14. The solution search method according to claim 11, wherein in the controlling, control is performed so that the sum of the measurement variables which are assigned to the respective test objects can be kept constant. KIM Song-Ju AONO Masashi NAMEDA Etsushi HARA Masahiko RIKEN G06N   7/00      特定の数学的モデルに基づいたコンピュータシステム G06Q  10/04      予測あるいは最適化，例．線形計画法，巡回セールスマン問題，板取問題