Content and data sets are acquired from many sources:
- Public Domain data
- Licensed content
- Digital Exhaust
- Client data
- Companies own internal data assets
Content engineers and Data Scientists perform many unique cleaning processes to get these data sets prepared for their analytics and model training experiments. This request has 2 parts:
Common Tooling and Scripts provided as part of a Data Refinery toolkit
As part of this preparation there are many common tools and processes that Content engineers and Data Scientists use in their own environment. This request is to make tools such as virus and malware scanners, copyright scanners, PII (Personally identifiable information) scanners and PII pattern scanners, available as part of the data transformation suite that can be used in projects and workflows to perform cleaning on data sets. With the vast array of data sources available today, it's critical that all Content follows best practices for security and compliance before allowing this content to be added to companies Repositories, Data Lakes and Catalogs.
User Specific Cleaning and Data Preparation tools
In addition to "Common" tools and processes, every team has unique scripts and procedures specific to their data formatting and data preparation requirements. This second part of the request is to enable the ability for a user to bring and add their custom scripts or docker containers that they have developed and hardened for years to projects and workflows. Making it easy to bring the tools and processes engineers and data scientists use in their environment today allows a faster easier adoption experience.