Skip to main content

Overview

The data quality module is where users configure the data quality checks they want to run on incoming data from their SurveyCTO form. These checks feed into a data quality monitoring dashboard with different metrics such as the percentage of data quality violations aggregated by enumerator and location. To enable the data quality module, select “Data Quality Dashboard” under Feature Selection.

Prerequisites

This step has the following prerequisites:
  1. Configure the main form on SurveyStream: Complete the SurveyCTO Integration step on SurveyStream and load the main form definition since all data quality checks are linked to a main form.
  2. (Optional) Deploy required data quality forms on SurveyCTO: Before configuring your data quality checks on SurveyStream make sure your data quality forms are deployed on SurveyCTO. This step is required only if you want to include checks based on data quality forms.
    Please check out the data quality form requirements section below for things to keep in mind while coding these forms.
  3. Configure Survey Status for Targets: SurveyStream provides an option to filter data based on Survey status variable before applying checks. This can be used to run checks only on completed and partially completed submissions. Therefore, ensure that the Survey Status for Targets module is complete and includes all the survey status values on which you would like to run the checks.
  4. Decide which checks to run and associated inputs: It will help if you have discussed and decided which checks to run and the inputs needed for each check before starting the configuration.

Configuration

Key concepts

  Data quality forms

Survey teams often create separate SurveyCTO forms for data quality processes that are carried out by a monitor. SurveyStream can use these data quality forms to calculate metrics like mismatch, protocol violation and spotcheck scores. SurveyStream supports the following types of data quality forms:
  1. Spotcheck: Monitor accompanies a surveyor to ensure they are following protocols
  2. Backcheck: Monitor calls back or revisits a respondent in person to check on a few responses to questions
  3. Audio Audit: Monitor listens to recording of survey to ensure that surveyors didn’t make entry errors and followed protocols correctly

  Data quality checks

SurveyStream supports ten different types of data quality checks:
  1. Logic: Check that certain skip patterns and logical relationships among variables are followed.
  2. Constraint: Check that the variable values fall within provided minimum and maximum constraints. Soft and hard constraint checks are available for finer grained monitoring.
  3. Outlier: Check whether continuous variables contain outliers, where an outlier is defined to be a certain multiple of the Inter Quartile Range or Standard Deviation or as values beyond a given percentile.
  4. Missing: Check if certain variables have a high percentage of missing values.
  5. Don’t Know: Check if certain variables have a high percentage of don’t know values.
  6. Refusal: Check if certain variables have a high percentage of refusal values.
  7. Mismatch: Check that a variable value in the main form matches the value of the same variable recorded in a data quality form.
  8. Protocol Violation: Check if a protocol has been violated as per entries in a data quality form.
  9. Spotcheck Score: Average the spotcheck scores recorded in data quality forms.
  10. GPS: Verify if GPS location of the household is within the sampled grid boundary coordinates or the GPS location of the household is same as the GPS of the sampled household within a margin of error.

Process

Configuring data quality forms

Adding data quality forms is very similar to adding a main form:
1

Form details

The first step is to provide the form details which includes:
InputDescription
Main SCTO formForm ID for the main SurveyCTO form linked to the data quality form
DQ form typeType of data quality form - audio audit, spotcheck or backcheck
DQ form IDForm ID of the data quality SurveyCTO form. This must match the form ID on the SurveyCTO form definition.
DQ form nameForm name of the data quality SurveyCTO form
2

SurveyCTO questions

The second step is to map the variables in the SurveyCTO form for the following required metadata fields:
  1. Target ID - Unique identifier for the survey respondent
  2. Enumerator ID - Unique identifier for the enumerator
  3. DQ enumerator ID - Unique identifier for the monitor who is filling out the data quality form
  4. Location variables (dynamic) - Unique identifier for each of the location levels configured in the survey
You can add multiple data quality forms for each main form and also edit/delete them if required.

Configuring data quality checks

There are two primary task to complete for this step:
1

Global configuration

This step has the following inputs:
  1. Select survey status values: Checks run on the SurveyCTO submissions with survey status values selected in this step. The dropdown has the list of all possible survey status values configured in the Survey Status for Targets module. This option is generally used to run checks only on fully completed submissions.
  2. Group by module name: When this option is selected, all checks will have a ‘Module name’ input and the metrics on the data quality monitoring dashboard can be grouped by module.
2

Configure checks

Here, you can provide the inputs for each check type. The inputs vary based on type of check. Below is a short description of inputs per check type:
These inputs are common across all checks:
InputDescription
Select variableThe variable that will be flagged if the check is violated.
Flag description(Optional) A short description of the flag that can be added on the dashboard for more context.
Filter group(Optional) Conditions for filtering the data before applying the check. The filter groups are joined by an OR operator and conditions within a group are joined by an AND operator.
Module Name(Optional) This is enabled when Group by module name is selected under global configuration and the value entered here is used to group results in the dashboard.
InputDescription
Other variables(Optional) Additional variables needed for the logic check’s assert conditions. These variables are assigned aliases B, C and so on. (Main variable is given the alias A)
AssertionsAssert conditions like A == B where A and B are aliases for the selected variables. The list of allowed operators in a condition are: +, -, *, /, **, >, >=, <, <=, ==, !=. Each assertion group is joined by an OR operator. Assertions within a group are joined by an AND operator.
Kindly note that assert conditions are conditions that must evaluate to True for the check to pass. If False, the submission is flagged. E.g., the logic check for “flag if income is less than or equal to 0 when age is greater than 30 and when the respondent has said they have an agricultural land” can be framed as:
Assert `income` > 0 if `age` > 30 AND `land` == 1
Here, the assert condition is: income > 0 and filters will be age > 30 and land == 1.
InputDescription
Hard Min/ MaxStrict minimum/ maximum values allowed for a variable
Soft Min/ MaxPreferred minimum/ maximum values for a variable
While each of these individual fields are optional, any one of these four fields must be non-empty.
InputDescription
MeasureThe metric to be used for outlier calculation: Inter Quartile Range, Standard Deviation or Percentile
Multiplier / ValueThe multiple of the interquartile range or standard deviation (like 1.96 times the standard deviation) or the percentile value (such as ± 5th percentile) that signifies an outlier.
InputDescription
ValueThe value/list of values which corresponds to missing/don’t know/refusal as per the form definition.
There are two modes for running these checks: Apply check on all variables in the form and Apply check on select variables. If Apply check on all variables in the form is selected, SurveyStream checks the form definition to find all questions for which the value specified is allowed as per the choice list and runs the check on those variables.For missing value checks, if the value is one of: (empty), NULL, NA or NAN, the check is run on all variables that are not mandatory (required !='yes').
InputDescription
Data quality formThe data quality form containing the variable to check against
Kindly note that for this check the variable name in the two forms have to be the same.
InputDescription
Data quality formThe data quality form containing the protocol question
Kindly note that for all protocol questions, calculations assume that the value 0 indicates a violation, while the value 1 indicates no violation.
InputDescription
Data quality formThe data quality form containing the spotcheck score question
Score Name(Optional) Scores from multiple questions can be combined and aggregated against this score name. If not provided, the question name is taken as the score name by default.
InputDescription
TypeThe type of check: Point to Shape or Point to Point. Point to Shape check verifies if GPS location of the household surveyed is within the expected grid cell or shape boundary. Point to Point check verifies if the household surveyed is the correct sampled household as per their GPS coordinates.
Grid ID VariableSurveyCTO question for the grid ID. This is a mandatory input for Point to Shape check type.
Expected GPS VariableSurveyCTO question for the expected GPS coordinates of the household surveyed based on a listing/sampling exercise. The GPS coordinates are expected to be in the format: “latitude longitude”. This is a mandatory input for Point to Point check type.
Threshold distance (m)The value of ‘X’ for checking whether GPS location of the surveyed household is within ‘X’ meters of a grid cell boundary or the sampled household’s GPS coordinates
For Point to Shape checks, the team has to also share the shape files for the grids with SurveyStream team. The file names of these shape files must follow the format: <grid id>.gpkg.

Walkthrough

[Add a configuration walkthrough video]

Adding/editing checks during the survey

You can add/edit checks during the survey following the same process as configuring checks for the first time. The changes will take roughly 30 minutes to 1 hour to reflect on the dashboard. Changes will apply on all submissions of the form which means all flags corresponding to a deleted check or inactive check will be removed and newly added checks will run on all submissions including submissions that came before the change.

Handling inactive checks

During the survey, SurveyStream refreshes the form definition from SurveyCTO every 30 minutes. If a variable is removed from the form definition, any check using that variable will automatically be marked as inactive and Survey Admins will receive a warning notification regarding this change. When inactive, the check is not run and the corresponding flags are removed. You can edit such inactive checks to replace the removed variables and then mark them as active again.

Additional notes

Data quality form requirements

  1. Ensure each data quality form has the following variables:
    1. Target ID - Unique identifier for the survey respondent
    2. Enumerator ID - Unique identifier for the enumerator
    3. DQ enumerator ID - Unique identifier for the monitor who is filling out the data quality form
    4. Location variables (dynamic) - Unique identifier for each of the geo levels configured in the survey
  2. For mismatch checks, ensure the variable name on the data quality form matches the variable name on the main form
  3. For protocol violation checks, ensure that 0 indicates a violation on all protocol related questions