Testing and Measurement
🟢 Lite — Quick Review (1h–1d)
Rapid summary for last-minute revision before your exam.
Testing and Measurement — Key Facts for NCE (Nigeria)
- Measurement: Assigning numbers to objects/events according to rules
- Assessment: Broader process including tests and non-test data
- Evaluation: Making judgments based on assessment data
- Test: Formal instrument measuring a sample of behavior
- ⚡ Exam tip: Validity ensures test measures what it claims; Reliability ensures consistency of measurement
🟡 Standard — Regular Study (2d–2mo)
Standard content for students with a few days to months.
Testing and Measurement — NCE (Nigeria) Study Guide
Basic Concepts
Measurement: The process of assigning numbers to objects or events according to rules.
Assessment: Broad term including tests, observations, portfolios, etc.
Evaluation: Making judgments or decisions based on assessment information.
Test: A standardized instrument designed to measure a sample of behavior.
Scales of Measurement
1. Nominal Scale:
- Categorization only
- Numbers used as labels
- Example: Gender (Male=1, Female=2), Ethnicity
- Operations: Count, mode
2. Ordinal Scale:
- Rank order
- Differences not equal
- Example: Class position (1st, 2nd, 3rd)
- Operations: Median, percentile
3. Interval Scale:
- Equal intervals
- No absolute zero
- Example: Temperature in Celsius
- Operations: Mean, standard deviation
4. Ratio Scale:
- Equal intervals + absolute zero
- True ratios possible
- Example: Height, weight, age
- Operations: All statistical operations
Qualities of Good Tests
Validity: The test measures what it claims to measure.
Types of Validity:
- Content Validity: Test covers all aspects of content
- Criterion-Related Validity: Comparison with external criterion
- Concurrent: Correlates with criterion at same time
- Predictive: Predicts future performance
- Construct Validity: Measures theoretical construct
Reliability: The consistency of test results.
Types of Reliability:
- Test-Retest: Same test given twice
- Parallel Forms: Two equivalent versions
- Split-Half: Two halves of same test
- Inter-rater: Agreement between raters
Reliability vs. Validity:
- A test can be reliable without being valid
- A test cannot be valid without being reliable
NCE Exam Pattern
Common question types:
- Differences between measurement scales
- Types and characteristics of validity/reliability
- Computing measures of central tendency and dispersion
- Interpretation of test scores
- Construction of tests and rubrics
🔴 Extended — Deep Study (3mo+)
Comprehensive coverage for students on a longer study timeline.
Testing and Measurement — Comprehensive NCE (Nigeria) Notes
Detailed Theory
1. Nature of Educational Measurement
Definition: Educational measurement involves assigning numbers to student performance according to systematic rules.
Why Measure?
- Diagnose learning difficulties
- Evaluate instruction effectiveness
- Assign grades and credits
- Selection and placement
- Accountability
Limitations of Measurement:
- Cannot measure everything important
- Always some measurement error
- What gets measured may not be what matters most
- Social context affects measurement
2. Scales of Measurement — Detailed
NOMINAL SCALE:
- Purpose: Classification into distinct categories
- Characteristics: Mutually exclusive categories, no order implied
- Permissible Statistics: Mode, frequency counts, chi-square
- Examples:
- Types of schools (public, private, mission)
- States of Nigeria (36 + FCT)
- Pass/Fail
ORDINAL SCALE:
- Purpose: Rank ordering
- Characteristics: Categories have order, but intervals unequal/unknown
- Permissible Statistics: Median, percentile, rank correlation
- Examples:
- Class position (1st, 2nd, 3rd)
- Socioeconomic status (low, middle, high)
- Grade levels
INTERVAL SCALE:
- Purpose: Measure magnitude with equal intervals
- Characteristics: Zero point is arbitrary, no true ratio
- Permissible Statistics: Mean, standard deviation, correlation
- Examples:
- Temperature (Celsius/Fahrenheit)
- Standard scores (z-scores, T-scores)
- Dates on calendar
RATIO SCALE:
- Purpose: Measure with true zero and equal intervals
- Characteristics: Absolute zero, true ratios meaningful
- Permissible Statistics: All statistical operations
- Examples:
- Height
- Weight
- Age
- Number of correct answers
3. Validity — Comprehensive Treatment
Definition: The degree to which evidence and theory support the interpretations of test scores for intended purposes.
Evidence-Based Validity:
- Content evidence (test content)
- Response process evidence (how test-takers respond)
- Internal structure evidence (relationships within test)
- Relations to other variables (criterion evidence)
CONTENT VALIDITY:
- Degree to which test samples the content domain
- Subject matter expert judgment required
- Test blueprint/table of specifications
- Example: Math test covering only algebra when geometry also required = low content validity
CRITERION-RELATED VALIDITY:
-
Concurrent Validity: Test correlates highly with criterion measured at same time
- Example: New IQ test correlates 0.85 with established IQ test
-
Predictive Validity: Test predicts future criterion
- Example: JAMB scores predict university performance
- Validity coefficient indicates predictive power
CONSTRUCT VALIDITY:
- Degree to which test measures a theoretical construct
- Construct: A theoretical concept (intelligence, anxiety, motivation)
- Multiple forms of evidence gathered
- Example: Intelligence test validates against theories of intelligence
FACTORS AFFECTING VALIDITY:
- Test content unrepresentative
- Item ambiguity
- Test anxiety
- Guessing
- Administration errors
- Interpretation errors
4. Reliability — Comprehensive Treatment
Definition: The consistency of scores obtained by the same persons on different occasions, with different items, or under different conditions.
TRUE SCORE THEORY:
- Observed Score = True Score + Error Score
- X = T + E
- Perfect reliability = error variance of zero
TEST-RETEST RELIABILITY:
- Same test administered twice
- Time interval between tests
- Correlation between scores = reliability coefficient
- High correlation = high reliability
- Problem: Memory effects, practice effects
PARALLEL-FORMS (EQUIVALENT-FORMS) RELIABILITY:
- Two equivalent versions of test
- Both administered to same group
- Correlation between forms
- Minimizes memory effects
SPLIT-HALF RELIABILITY:
- One test, divided into two halves
- Odd-numbered vs. even-numbered items
- Correlation between halves
- Spearman-Brown prophecy formula adjusts for full test
INTER-RATER RELIABILITY:
- Agreement between two or more raters
- Cohen’s Kappa for categorical judgments
- Pearson correlation for continuous scores
- ICC (Intraclass Correlation Coefficient)
RELIABILITY COEFFICIENTS:
- Range: 0 to 1.00
- 0.90+ = Excellent (high-stakes decisions)
- 0.80-0.89 = Good (classroom use)
- 0.70-0.79 = Adequate (group decisions)
- Below 0.70 = Questionable
RELIABILITY AND STANDARD ERROR OF MEASUREMENT:
SEM = SD × √(1 - r)
- SEM provides range within which true score likely falls
- Higher reliability → Smaller SEM
5. Measures of Central Tendency
MEAN:
- Arithmetic average
- Most sensitive to extreme scores
- Best for interval/ratio data
- Formula: Σx/n
MEDIAN:
- Middle value when arranged in order
- Less affected by extreme scores
- Better for ordinal or skewed distributions
- Position = (n+1)/2
MODE:
- Most frequently occurring value
- Used with nominal data
- May have no mode or multiple modes
When to Use Each:
| Data Type | Best Measure | Reason |
|---|---|---|
| Nominal | Mode | Only appropriate |
| Ordinal | Median | Rank order |
| Interval/Ratio (symmetric) | Mean | Most sensitive |
| Interval/Ratio (skewed) | Median | Resistant to outliers |
6. Measures of Dispersion
RANGE:
- Maximum - Minimum
- Simplest measure
- Affected by outliers
VARIANCE:
- Average of squared deviations from mean
- Population variance: Σ(x-μ)²/N
- Sample variance: Σ(x-x̄)²/(n-1)
STANDARD DEVIATION:
- Square root of variance
- In same units as original data
- Most commonly used measure
- Formula: σ = √[Σ(x-μ)²/N]
COEFFICIENT OF VARIATION:
- CV = (SD/Mean) × 100
- Allows comparison across different scales
- Useful for comparing variability of different distributions
7. Normal Distribution and Standard Scores
Normal Distribution:
- Bell-shaped, symmetric
- Mean = Median = Mode
- Defined by mean and standard deviation
- 68% within 1 SD, 95% within 2 SD, 99.7% within 3 SD
Z-SCORES:
- Standard score showing position in SD units
- z = (X - μ)/σ
- Mean of z-scores = 0
- SD of z-scores = 1
T-SCORES:
- z-score transformed to have mean of 50 and SD of 10
- T = 50 + 10(z)
PERCENTILE RANKS:
- Percentage of scores below given score
- 60th percentile = scored higher than 60% of test-takers
- Not equal intervals — difference between percentiles varies
8. Types of Tests
Standardized Tests:
- Norm-referenced or criterion-referenced
- Administered under uniform conditions
- Content and scoring standardized
- Examples: WAEC, NECO, JAMB
Teacher-Made Tests:
- Designed for specific classroom
- Based on specific instruction
- More flexible format
- Diagnostic purposes
CRITERION-REFERENCED vs. NORM-REFERENCED:
| Aspect | Criterion-Referenced | Norm-Referenced |
|---|---|---|
| Purpose | Mastery of objectives | Relative standing |
| Comparison | To standard | To other test-takers |
| Interpretation | % who mastered | Percentile rank |
| Example | Driving test (pass/fail) | IQ test (percentile) |
9. Test Construction
STEPS IN TEST CONSTRUCTION:
- Define objectives/content to be tested
- Prepare table of specifications
- Select item types
- Write items
- Review and edit items
- Produce final test
- Administer
- Analyze items
- Revise as needed
TABLE OF SPECIFICATIONS (Test Blueprint):
- Grid showing content areas vs. cognitive levels
- Ensures representative sampling
- Guides item writing
- Documents content validity
ITEM WRITING PRINCIPLES:
- Clear, unambiguous language
- One main idea per item
- Avoid clues (grammatical cues, word frequency)
- Appropriate difficulty
- Free from bias
- Correct answer only one option
10. Item Analysis
DIFFICULTY INDEX:
- P = Number correct / Total number
- Range 0 to 1
- 0.30-0.70 ideal for most purposes
- Too easy (P>0.90) or too hard (P<0.20) = poor discrimination
DISCRIMINATION INDEX:
- Difference between upper and lower groups
- D = (% in upper group correct) - (% in lower group correct)
- Range -1 to +1
- 0.40+ = Good discrimination
- Negative = Item may be keyed incorrectly
Practice Questions for NCE
- Differentiate between validity and reliability, explaining why a test can be reliable without being valid.
- A test has a mean of 50 and standard deviation of 10. Calculate the z-score for a student scoring 70.
- Explain the differences between norm-referenced and criterion-referenced tests.
- What is the Standard Error of Measurement and how does it affect interpretation of test scores?
- Describe the steps involved in constructing a classroom test.
Content adapted based on your selected roadmap duration. Switch tiers using the selector above.