The first striking thing for me is that the graphs for Leela and Maia mostly agree, which is kind of amazing since they work quite differently
me: well. I wonder if there is not some high level information re-circulation here. It might depends on how maia was trained. The principle of it is a model with a perfect play benchmark and then a binned category variable dependent error model with few non-board hyperparameters (if i got it), and maybe some position information dependency (???). It might be that in the conversion calibration from SF and using lichess data, their odds function definition might be so that your observation would follow.
As while the chess play itself or the NN training using the same LC0 input encoding, but has different purpose, (the error from SF I presume, or other perfect play benchmark definition), I smell somewhere in there, that we should check the flow of information. In the high level statistics about odds. I am not sure that those are independent. I don't even understand what it being trained. Fitting the licehss outcome and the error model at each position, giving the error model fit and its parameter estimation.
If I am readable, do you have yourself an understadning of what was the machine learning setup. The degsing of the training data matrix. We have lichess games. all their positions, the players ratings, and the game outcomes. We also have for each positions within that set of games (EPDs being same position or not, see my other comment on definition of position difficulty), the SF benchamark modulo or added error part (whatever the function of error and benchamark dictated by the model assumption) that would fit the outcome. IS that right? don'T have to answer if not making sense. don'T sweat. but if you share my questions... please help.
My difficulty with using the word contempt might be that it means something in UCI and SF types of engine. And I am not sure that this is the right term. It is as if one knew that the other is not playing like self play. In the SF type of engine meaning of contempt. I may not understand it at all.
But for Lc0, and from reading your thinking (sorry to take so much space in comparison), it would be in the direction away from best play continuation from the root position in question.
So, it is basically meaning that they do want to include more information about the differential of rating or even the rating pair itself (as the actual information might be more than the difference or ration, but it could be the pair of variables. That would be the full probability model, if in the world of rating systems. But I get that we might not have the statistical "power" to start there. (although I thought I saw something in that figure 11, beyond the binning.. Just missing the hidden dispersion).
I may also have a lack of chess understanding about why not use all the informatino at hand..
Ok. I understand your trajectory. Self-play, as in human stratifications, or band pairing for both optimal leanring and game enjoyments, and tournament tearing efficiency (my words, this might have been evolved over history of tournaments formats).
I agree that in order to control some floating variables, or confounding factors, beside best play, there is self-play. It could be a more adaptive angle at in the foresight problem of chess evaluation of candidate moves, from the point of view of any level of players. As when we chrun plans before decision and consider the opponent responses, we only have our own internal model of chess to use, and its skill set levels, which of course for now, we can only average into ratings.
While thinking about this, I asked myself how well engines might predict the expected score between humans. Using Stockfish wouldn't work since it's way too strong. Specifically, the expected score should be adapted to different rating ranges. There are two engine approaches that came to mind for me, namely Maia and the WDL contempt feature of LC0.
me having some previous thoughts in that direction, and curious about your point of view, below. I am also trying to improve readablibility by pushed some tangential questions in passing to the end.. I hope that works (when i can catch myself..).
Now my thoughts:
You are aware that the first maia scientific article (or a preprint) was actually asking a similar question: which is basically how a SF score differential** (my words) on a position in game database (? depth of position? so many questions...) relates or converts into odds while pooling bands of pairing into some number of bins.
I am fascinated by the information one can see with blurry vision, beyond the binning. But I could not read the details of the statistical moments behind the smoothly, well staggered conversion curves across the binning. As if there was information to be had in finer grain binning (even slope information), that would be lost in binning (not only the pair averaging, but the function dependency on average as a possibly finer grain ordered variable). Figure 11 of some version of their preprint or even the final version.
And I am aware that LC0, and even Lichess before that (from "learn from your mistake, human chess concern"), and now even SF transitionning or having somewhat internal (simple eval is still material, though) altenative scale in terms of odds. And yes, all those are about high level play. But not maia preliminary data analysis*** (figure 11 for example). I only did not do the Feynman thorough job only had a question mark passing as distraction, about what was the defintion of position difficulty (using SF) exactly (e.g. was it average over all games visiting the position, was it position as FEN or as EPD without op code, i.e. no depth, but all EPD required there, my preference for the human thinking task).
Now I forgot other points. Maybe already too much. If aware of that figure. What are your thoughts.
**or position difficulty (would be curious of hearing your comment on how their definition of position difficulty using SF relate to the objects you have been exploring as similar conceptual measure (such as sharpness) of a position, that is tangent in passing sorry).
*** (I also lost track of where they wrote a lot more of that pre-maia engine dev, lichess data battery of statistical characterization, I actually was more curious about that, but the engine toy seems to be the more popular direction. I thought it was in the first paper. I am bad with adresses of content..
Calculating the win probabilities from a big dataset of games is also interesting, but I think that using SF evaluation could also lead to problems where some +3 positions are trivial wins for amateur players while others are much more difficult to convert. I didn't go into that direction since it would take too much computing power to evaluate a significant number of positions to make meaningful statistics.
Fascinating, as always Julian! Keep up the good work.
Thanks Ben :)
you:
The first striking thing for me is that the graphs for Leela and Maia mostly agree, which is kind of amazing since they work quite differently
me: well. I wonder if there is not some high level information re-circulation here. It might depends on how maia was trained. The principle of it is a model with a perfect play benchmark and then a binned category variable dependent error model with few non-board hyperparameters (if i got it), and maybe some position information dependency (???). It might be that in the conversion calibration from SF and using lichess data, their odds function definition might be so that your observation would follow.
As while the chess play itself or the NN training using the same LC0 input encoding, but has different purpose, (the error from SF I presume, or other perfect play benchmark definition), I smell somewhere in there, that we should check the flow of information. In the high level statistics about odds. I am not sure that those are independent. I don't even understand what it being trained. Fitting the licehss outcome and the error model at each position, giving the error model fit and its parameter estimation.
If I am readable, do you have yourself an understadning of what was the machine learning setup. The degsing of the training data matrix. We have lichess games. all their positions, the players ratings, and the game outcomes. We also have for each positions within that set of games (EPDs being same position or not, see my other comment on definition of position difficulty), the SF benchamark modulo or added error part (whatever the function of error and benchamark dictated by the model assumption) that would fit the outcome. IS that right? don'T have to answer if not making sense. don'T sweat. but if you share my questions... please help.
I'm not 100% sure but I think that Maia was only trained from the Lichess games without any additional information from other engines.
My difficulty with using the word contempt might be that it means something in UCI and SF types of engine. And I am not sure that this is the right term. It is as if one knew that the other is not playing like self play. In the SF type of engine meaning of contempt. I may not understand it at all.
But for Lc0, and from reading your thinking (sorry to take so much space in comparison), it would be in the direction away from best play continuation from the root position in question.
So, it is basically meaning that they do want to include more information about the differential of rating or even the rating pair itself (as the actual information might be more than the difference or ration, but it could be the pair of variables. That would be the full probability model, if in the world of rating systems. But I get that we might not have the statistical "power" to start there. (although I thought I saw something in that figure 11, beyond the binning.. Just missing the hidden dispersion).
I may also have a lack of chess understanding about why not use all the informatino at hand..
Ok. I understand your trajectory. Self-play, as in human stratifications, or band pairing for both optimal leanring and game enjoyments, and tournament tearing efficiency (my words, this might have been evolved over history of tournaments formats).
Aligning Superhuman AI with Human Behavior: Chess as a Model System
https://arxiv.org/abs/2006.01855
7.2 Centipawn to win probability
https://www.semanticscholar.org/reader/5a32e6268aa5eaab368c8cdcbb6a571da5e42c28#
The figure 11. is in the supplemental material in last version. The arxiv pdf has it, but above link is on the web.
I agree that in order to control some floating variables, or confounding factors, beside best play, there is self-play. It could be a more adaptive angle at in the foresight problem of chess evaluation of candidate moves, from the point of view of any level of players. As when we chrun plans before decision and consider the opponent responses, we only have our own internal model of chess to use, and its skill set levels, which of course for now, we can only average into ratings.
you:
While thinking about this, I asked myself how well engines might predict the expected score between humans. Using Stockfish wouldn't work since it's way too strong. Specifically, the expected score should be adapted to different rating ranges. There are two engine approaches that came to mind for me, namely Maia and the WDL contempt feature of LC0.
me having some previous thoughts in that direction, and curious about your point of view, below. I am also trying to improve readablibility by pushed some tangential questions in passing to the end.. I hope that works (when i can catch myself..).
Now my thoughts:
You are aware that the first maia scientific article (or a preprint) was actually asking a similar question: which is basically how a SF score differential** (my words) on a position in game database (? depth of position? so many questions...) relates or converts into odds while pooling bands of pairing into some number of bins.
I am fascinated by the information one can see with blurry vision, beyond the binning. But I could not read the details of the statistical moments behind the smoothly, well staggered conversion curves across the binning. As if there was information to be had in finer grain binning (even slope information), that would be lost in binning (not only the pair averaging, but the function dependency on average as a possibly finer grain ordered variable). Figure 11 of some version of their preprint or even the final version.
And I am aware that LC0, and even Lichess before that (from "learn from your mistake, human chess concern"), and now even SF transitionning or having somewhat internal (simple eval is still material, though) altenative scale in terms of odds. And yes, all those are about high level play. But not maia preliminary data analysis*** (figure 11 for example). I only did not do the Feynman thorough job only had a question mark passing as distraction, about what was the defintion of position difficulty (using SF) exactly (e.g. was it average over all games visiting the position, was it position as FEN or as EPD without op code, i.e. no depth, but all EPD required there, my preference for the human thinking task).
Now I forgot other points. Maybe already too much. If aware of that figure. What are your thoughts.
**or position difficulty (would be curious of hearing your comment on how their definition of position difficulty using SF relate to the objects you have been exploring as similar conceptual measure (such as sharpness) of a position, that is tangent in passing sorry).
*** (I also lost track of where they wrote a lot more of that pre-maia engine dev, lichess data battery of statistical characterization, I actually was more curious about that, but the engine toy seems to be the more popular direction. I thought it was in the first paper. I am bad with adresses of content..
Calculating the win probabilities from a big dataset of games is also interesting, but I think that using SF evaluation could also lead to problems where some +3 positions are trivial wins for amateur players while others are much more difficult to convert. I didn't go into that direction since it would take too much computing power to evaluate a significant number of positions to make meaningful statistics.