I have an opportunity at Honda for Knowledge Catalog in their back office operations team. This team is charged with ad-hoc reporting from a number of on premise data sources (e.g. Oracle, SQL Server, Excel Spreadsheets, Lotus Databases, etc). These reports involve the need to classify data among other basic ETL like functions. As part of the classification step, they need to look through specific columns of data (e.g. vehicle feature) for specific attributes. Unfortunately, each division and organization in Honda use slightly different terms to reflect the same thing (e.g. 4 wheel drive is referred to as 4WD, 4-wd, awd, AWD, etc). Often times they need to classify this data first before they can process it further by tools like Watson Studio and Watson Analytics. We would like a classifier ETL function added to WKC which allows the user to define a csv or xls file containing all the synonyms and their corresponding classifier. The synonyms need to allow for wildcards so a syntax like regex would be of great value. In addition, the synonyms need to support multiple languages in the example included it is English and Japanese. To improve the process there needs to be a setting for using Thesaurus.com to enrich synonyms without having to list them all. For example, if I turn on Thesaurus.com for my classifier then all the synonyms of the synonym entered in the csv file would be used in addition to the synonym specified. For example, U.S. would be automatically extending to include U.S.A., United States of America, etc. If any of the additional Synonyms in Thesaurus.com are located in the data set then they too are classified by the specified classifier.
As a nice to have it might be valuable to offer a second classifier algorithm based on Watson Natural Language Classifier where the user can import the xls/csv, but it is loaded into NLC behind the scenes and uses the NLC machine learning model for classification. However, this NLC option is secondary to the primary need of using basic text and wildcard token location outlined above.
I have demonstrated WKC to both Cisco and Fluor and they also have use cases requiring this same functionality.
Why is it useful?
|Who would benefit from this IDEA?||Data Stewards, Business Analysts and Data Scientists|
How should it work?
|Submitting Organization||F2F Sales|