Furthering the discussion regarding test validity and reliability, here is a final tip to assist you with your analysis.
Avoid getting hoodwinked
As a reminder, a test should be able to predict in a statistically significant way performance differences among people or some performance outcome. Validity is always a statistical determination and never a subjective one. What is called face validity is not validity in the true sense of the word, but is really more akin to Facebook Likes and Dislikes. You should be justifiably cautious of any test that makes a claim, such as “89% of those who received feedback said the results described them accurately,” particularly if no specific statistical data is also provided. A test is not valid simply because people like what it says about them.
Validity and reliability are expressed as correlation coefficients, which essentially mean the extent to which two things move in unison and which evidence a cause–and-effect relationship. For example, in the first two years of life, we would expect to see a high correlation between the weight and height of babies. Correlations express likelihood – the extent to which one variable likely influences something else. So, if a vendor tries to explain validity in some other way, for example as an accuracy percentage, there is simply no scientific basis for that. It’s baloney.
As noted above, in this era of big data, spin is becoming more prevalent, and you need to watch out for it. As an example, in measuring test reliability, the generally-accepted cutoff for a trait scale would be a .70 correlation. The higher the correlation, the greater the reliability, so .85 is a lot better. Tests have multiple scales, so if one falls slightly below .70, that does not nullify the value of the test or mean that it shouldn’t be used. It simply means that specific scale should be treated more cautiously. The spin angle is apparent today with several instruments that have numerous scales that fall well below the traditional cutoff. The reality is that the scales are weak and their value is questionable. One vendor in particular is using a white paper to rationalize many weak scales by claiming new and more subjective measures of reliability make the .70 threshold less meaningful. That’s obfuscation by complexity just to defend something that may be indefensible. If you drill into their literature and see scales where r=.55 or something similar, understand that the scale is weak and a poor measure of whatever it’s attempting to identify.
Follow up?
There’s much more to understanding all the considerations of test construction and validation than what can be covered in the space of two blogs, but as they say, this is a start. Please email me at fgump@2oms.com with questions, or comment below. You can also reach us on Twitter at @ADGIGroup or on Facebook.
__________________________________________________________________________For more than forty years, Frank Gump has been helping corporations become more productive and profitable by helping management teams identify and hire top performers and manage them most effectively. Developed and refined through extensive experience in more than 1200 organizations in the United States, Canada, England, and Australia, ADGI’s Organizational Management System (OMS) is a finely calibrated, technologically advanced decision-making process offering the potential for enormous payback. Contact ADGI for more insight and connect with Frank on LinkedIn. Follow ADGI on Twitter @ADGIGroup. Like ADGI on Facebook and follow us on Google+.
You may also like
-
That Time Your Lawyer Was Glad You Used A Hiring Assessment
-
Sorting Out the Junk Science in Psychometrics
-
The eight week career, or Why do so many call center sales people leave so quickly?
-
How to Use Personality Tests for Employee Selection and Rejection
-
Picking a Test that Works and Suits Your Needs, Part 1