I have not had time to read through all the research challenge posts so I apologize if anyone else has done this before. I do not in any way try to steal anyone else's thunder. I was interested in seeing how much reach mattered so I made variable called reach difference (Winner's Reach - Loser's reach) for the first 783 fighters and labeled them as "winners" and then did Loser's Reach - Winner's Reach and labeled them "losers". Now I had 783 losers and 783 winners and the reach advantage they had or gave up towards their opponent. Then I coded a loss as 0 and a winner as 1 and did a basic spearman 2-tailed correlation test.
Correlation means that two variables co-variate. A classic example of this is height and length. If one person is taller than the other there is also a good chance that he/she is heavier. However there are a lot of exceptions to this meaning that this is not a perfect correlation. Correlations vary between -1 (a perfect negative correlation) , 0 (no correlation whatsoever) and 1 (a perfect positive correlation). If a correlation between two variables is 1 that means that all variation in X can be explained by variation in Y and vice versa. However one must also look at confounding variables and have to be very careful before drawing conclusions about causality. A classic example is that there is an almost perfect correlation between ice cream sales and drowning accidents. Does eating ice cream make people drown? Of course not, here the confounding variable is good weather which leads to more ice cream sales and more drowning accidents since people swim more. So both these variables are related to the good weather and there is not a casual link between the two of them. Another example like that is the correlation in post war Brittain between the number of radios and people institutionalised in mental hospitals. Of course these were both effects of modernisation and had nothing to do with each other even though they showed an almost perfect statistical correlation. A major confound in this study might be that a fighter with more reach might also be heavier than his opponent and reach doesn't really matter but the factor that makes the difference here is weight at fight night.
This correlation test gave that: spearman's rho = 0.062, p = .017. This means that there is a significant correlation between reach and winning and losing. With the fighter with the higher reach also being more likely to win. However, a spearman's rho of 0.062 is a very small correlation even though it was significant. To see how much of the variation in win/lose that is accounted for by variation in reach one must multiply the spearman's rho value with itself giving 0.062 * 0.062 = 0.003844. This means that reach is only accountable for 0.38 percent in the variation of whether a fighter wins or loses!
I also split this file dividing it into figherrs with long reach and small reach. The mean level for the winning fighter was 72.9 inches so that is where I split the file. Putting the fighter with high reach in one group and fighters with low reach in another. Then I did the same correlation tests again.
For fights in which the winning fighter had a reach above 72.9 inches the correlation was, spearman's rho = 0.415. p < .001. This is a much bigger correlation and also highly significant. Here reach accounted for 17,22 percent of the variation in whether a fighter wins or loses!
The same test for the shorter fighters gave that spearman's rho = -372, p < .001. This means that among shorter fighters the fighter with less reach is significantly more likely to win, with reach accounting for 13,83 percentage in the variation of whether a fighter wins or loses.
Dividing the file like this could explain why we got such a low correlation in the first place when we used all of the fighters since these two effects cancelled each other out.
I would love to hear your thoughts on these statistics and also on why you think a reach advantage seems to be a good thing in the higher weight classes but a bad thing in the lower ones? HW knock out machines vs stocky FW wrestlers?
UPDATE: Thank you for your comments! To test if there were some issues with just splitting the file in half, I did the same test as above once again but reversed it so I defined the first 743 fighters in the original file as losers and the last 743 as winners and did the same calculation to see the correlation between Reach difference and outcome (win or loss) and found very similar results as before.
I also did a further breakdown of the data, dividing the winning fighters' reach into four different quartiles. The first group was for fighters with a reach 0f 70 or below. The second one for fighters with a reach between 70.1 and 73, the third group was for reaches 73.1 to 75.5 and the last one for fighters with a reach higher than 75.5.
Then I did the same correlation tests as described above again, giving that:
For fighters with a reach of 70 or below, the correlation between reach difference and outcome was -.442 , p < .001.
For the group 70.1 to 73, spearman's rho was = -.185, p < .001.
For the group 73.1 - 75.5, spearman's rho was = 0.153, p = .01
For the group 75.6 and above, spearman's rho was = .651, p<.001
This gives support for the hypothesis that there is a significant interaction effect between reach and reach advantage. Meaning that the more reach you have, the more good reach advantage does you and vice versa: the less reach you have the more you benefit from a reach disadvantage. Now I am pretty sure that fight night weight and size in general is a big factor here and it would be really interesting to get an estimation of weight on fight night and see if reach is still a significant factor.
However I still don't really see why a reach disadvantage benefits a fighter in the lower weight classes.
I did a crosstab including fighter's reach divided into the four groups mentioned earlier.
As we see in this table (or are attempting to see at least since I for some reason can't get a proper picture quality when copying from SPSS) is there is significant assymetry. That is, which reach group you belong to effects the way you will win. For example, if it was just random, 123,4 for fights involving the bigger fighters would end in KO/TKO, in this data though, 164 of the fights did so, meaning that bigger fighters are more likely to end a fight in KO/TKO than random. In the same way for the smallest fighters, they have a higher probability of fights going to decision and lower probabilities of a fights ending with KO/TKO.
That there is a connection between fighter size and method of winning means that we have multicollinearity (spelling?) in our data meaning that it is very hard to know what the causal link between reach and fight outcome is. As we saw in my previous post the fighter with the bigger reach also has a significant higher chance of winning when a fight ends in KO/TKO but since bigger fighters also go to KO/TKOs more often, could this be attributed to the reach advantage or to the fact that these fights end in KO/TKOs more often. What exactly is it that we have measured here? I am trying to find a smart way to make a loglinear regeression analysis involving reach difference, method of winning and fighter reachgroup to see if these variables can predict fight outcome and see each variables partial effect on the prediction.
If you just add these variables to a binary logistic regeression you get a very good model, able of succesfully estimating the win/lose outcome in 88.6 percent of the fights in this data set. However it can not really be succesfully used for gambling since you don't know the method of winning before making your bet (if the fight doesn't involve GSP or Fitch that is (no, hating, I'm a big fan of both fighters)). This model also gives a Nagelkerke R Square (variation in outcome explained by this model) of 83.3 &. However, the only significant predictor in this model is reach difference (p=.001), the other two predictors are highly unsignificant (p=.931 and .982). However since we have clearly seen what we have multicollinearity here it is hard to know how much of the variations in the other two variables is accounted for in the reach difference variable (since we know that there is a correlation between way of winning and reach difference and we know that reach group effects if the reach advantage is beneficial or counterbeneficial). A model with only reach difference as a predictor variable for example still gives that reach difference is a significant predictor but only guesses the outcome of the fights correctly in 52,0 percent of the cases. This is still significantly better than chance but way worse than the model suggested above.
I would love to hear your thoughts on this!