The analysis of large data sets requires an ability to convert data into meaningful business information. Only a few years ago twenty gigabytes of data was considered very large, but with advances in storage technology it is not unusual for companies today to deal with terabytes of data.
The challenge for corporations is to make data quality requirements and data quality assessment an integral part of every project and also make data quality assurance a part of every application’s continuing data management practice.
Data quality has multiple dimensions:
- - Accuracy
- - Timeliness
- - Relevance
- - Completeness
- - Understood by users
- - Trusted by users
When analyzing large datasets firstly, it is necessary to create associations between multiple data sources that can be sophisticated. For instance, if the marketer receives impression and click report from one vendor and source-related report from another vendor as email attachments, it would be reasonable to set up automated processes. Email2DB software could be one of the possibilities since it retrieves an email attachment received from a specific email address, places into the specified folder and imports into the Access or SQL server database (Email2DF software àFolderàAccess/SQL database). It streamlines the processes and decreases human manual intervention. Secondly, it is appropriate to process a new piece of information prior to placing it into the Master database. Re-processing and subsequent data indexing may delay the processes. Thirdly, landing pages, checkout process, unsubscribe forms, and other critical tagged pages should be continuously monitored to ensure the pixels always fire which prevents long periods of data loss. Fourthly, regular auditing (monthly, quarterly) should be conducted. Every page over the entire site should be audited quarterly to confirm analytics are working correctly, to identify any problems, and to show complete implementation for reporting purposes. Fifthly, verification and re-verification processes should be added to the reporting and analysis. Identifying and tracing where the error might be originating from should become one of the essential steps in validating data:
1. Initial data entry. This can result from errors (typos) made by the data entry person, lack of training by entry people, poor data entry form design or deliberate mistakes. For example, from a data analytical perspective NY, New York and N.Y. are considered inaccurate representing three different states. Values need to be consistent in order to provide accurate query results. One of the solutions is to use address standardization software offered by United States Postal Services that standardizes address information.
2. Decay. Data values can be accurate when entered but become inaccurate over time. This is very common for information about people such as address, home phone number, marital status, number of dependents.
3. Data movement. Another important source of inaccurate data comes from errors made in extracting, manipulating, transforming and loading data into the internal database (for example, SQL server, Access databases). Sometimes the processes are built without a thorough knowledge of all the tricks of the data sources. It is very difficult, almost impossible to predict how the next dataset that the marketer will receive will look like especially if it is being modified by a data entry person.
4. Data use. Very often the data becomes inaccurate is when it is retrieved from the database and incorporated into reports, spreadsheets, query results or portal documents. The marketer must understand the true semantics of data, interpret results correctly in order to generate highly accurate reports.