Picking a Test that Works and Suits Your Needs, Part 2

Furthering the discussion regarding test validity and reliability, here is a final tip to assist you with your analysis.

Avoid getting hoodwinked

testing mrgAs a reminder, a test should be able to predict in a statistically significant way performance differences among people or some performance outcome. Validity is always a statistical determination and never a subjective one. What is called face validity is not validity in the true sense of the word, but is really more akin to Facebook Likes and Dislikes. You should be justifiably cautious of any test that makes a claim, such as “89% of those who received feedback said the results described them accurately,” particularly if no specific statistical data is also provided. A test is not valid simply because people like what it says about them.

Validity and reliability are expressed as correlation coefficients, which essentially mean the extent to which two things move in unison and which evidence a cause–and-effect relationship. For example, in the first two years of life, we would expect to see a high correlation between the weight and height of babies. Correlations express likelihood – the extent to which one variable likely influences something else. So, if a vendor tries to explain validity in some other way, for example as an accuracy percentage, there is simply no scientific basis for that. It’s baloney.

As noted above, in this era of big data, spin is becoming more prevalent, and you need to watch out for it. As an example, in measuring test reliability, the generally-accepted cutoff for a trait scale would be a .70 correlation. The higher the correlation, the greater the reliability, so .85 is a lot better. Tests have multiple scales, so if one falls slightly below .70, that does not nullify the value of the test or mean that it shouldn’t be used. It simply means that specific scale should be treated more cautiously. The spin angle is apparent today with several instruments that have numerous scales that fall well below the traditional cutoff. The reality is that the scales are weak and their value is questionable. One vendor in particular is using a white paper to rationalize many weak scales by claiming new and more subjective measures of reliability make the .70 threshold less meaningful. That’s obfuscation by complexity just to defend something that may be indefensible.  If you drill into their literature and see scales where r=.55 or something similar, understand that the scale is weak and a poor measure of whatever it’s attempting to identify.

Follow up?

There’s much more to understanding all the considerations of test construction and validation than what can be covered in the space of two blogs, but as they say, this is a start. Please email me at fgump@2oms.com with questions, or comment below. You can also reach us on Twitter at @ADGIGroup or on Facebook.

__________________________________________________________________________13259f4For more than forty years, Frank Gump has been helping corporations become more productive and profitable by helping management teams identify and hire top performers and manage them most effectively. Developed and refined through extensive experience in more than 1200 organizations in the United States, Canada, England, and Australia, ADGI’s Organizational Management System (OMS) is a finely calibrated, technologically advanced decision-making process offering the potential for enormous payback. Contact ADGI for more insight and connect with Frank on LinkedIn. Follow ADGI on Twitter @ADGIGroup. Like ADGI on Facebook and follow us on Google+.

Picking a Test that Works and Suits Your Needs, Part 1

test mrg Trying to navigate through test validity and reliability is a jungle! After reviewing a myriad of validation claims over the years, you begin to realize that truth is sometimes hard to find.

Here are a couple of tips to help you with any investigation you might want to do. Remember, the goal of any test is to add situationally-relevant insight, so if it doesn’t do that, you need to move on to something that will.

“Frankly, I’m shocked!”

Whereas claims from some test vendors are straightforward, others are disingenuous. Some vendors simply make claims with no supporting documentation, others publish weighty tomes with irrelevant content in the belief that people will associate truth with weight and technical complexity, and still others try to support their claims with nonsensical information. And now there’s a new twist: Some vendors are trying to reframe accepted measures of validity or reliability to make their instruments look better than they really are. Lipstick on a pig? Sure sounds like it…

Pick the right tool for what you want to do.

If you are going to use a behavioral assessment, you first need to make sure that you are selecting the right type of instrument for your needs. There are two types of tests to choose from: a normative design and an ipsative design. A normative test is intended for decision making, because it compares individuals to a work group or a defined population and allows individuals to be compared to one another. In other words, when you’re trying to determine whether or not a new hire fits your company culture, this is the best option In contrast, ipsative instruments are most appropriate for personal discovery or group-understanding applications where people are not compared with one another and decisions are unnecessary. Such tests are based upon self-referent measures of relative behaviors and strengths and don’t offer a meaningful basis for comparing people. Ipsative tests are primarily intended for coaches and trainers who are trying to identify the talents of their clients and teams.

Although some vendors of ipsative instruments point out the purposes and limitations of their test design, others don’t. Here’s where the spin comes in: At least one vendor goes so far as to claim that, because they have more than 10 scales, their results approximate those of a normative test, which then begs the question: Why not just use a normative test rather than a wannabe?

The bottom line is: Don’t get blinded by brand or fooled by spin. Find out which tools are appropriate for your applications and information needs.

Understand what to look for.

Validity and reliability in a business decision-making context are really very simple:

A test or instrument should measure what it claims to measure, which is called construct validity. For example, if a test measures social initiative and friendliness, does it accurately distinguish between those who more sociable and those who are not?

A test should show evidence that the scales have internal consistency and that repeated test results are consistent. This is reliability. If that test supposedly measuring social initiative shows different results over several administrations, then it’s really not measuring anything.

Finally, a test should be able to predict in a statistically significant way some performance outcome. This is criterion or predictive validity. If you are using a test to make more placement decisions, then more accurately predicting performance or some dimension of performance is the goal.

In the next post, you can learn how to avoid statistical data spins.

__________________________________________________________________________13259f4For more than forty years, Frank Gump has been helping corporations become more productive and profitable by helping management teams identify and hire top performers and manage them most effectively. Developed and refined through extensive experience in more than 1200 organizations in the United States, Canada, England, and Australia, ADGI’s Organizational Management System (OMS) is a finely calibrated, technologically advanced decision-making process offering the potential for enormous payback. Contact ADGI for more insight and connect with Frank on LinkedIn. Follow ADGI on Twitter @ADGIGroup. Like ADGI on Facebook and follow us on Google+.