IBM Cloud Databases - Structured Ideas

This Ideas portal is being closed.  Please enter new idea at http://ibm.biz/IBMAnalyticsIdeasPortal

Provide Meta -Data and -Analysis Catalog Management

Provide meta -data and -analysis catalog and associated management and collaboration capabilities to enable both the description of data elements with respect to inherent structure (e.g. domain, range of numeric fields) as well as relative structure (e.g. text fields representing enumerated set members: dog, cat, other), as well as consumption, modification, and version control across semantic release levels (n.b. see https://semver.org/).

A simple use-case is a column in a table containing text in which the contents represent an enumerated set (e.g. red, green, blue).  The UX would include the following major steps:

PART A: Building the Catalog

1) Select a column displayed in the user interface (e.g. "Color (Type: String)") and inspect the range of "String" (e.g. count, unique, most common, least common, unspecified); see CSVKIT's csvstat(1) command for example output

2) Find in a faceted catalog a defined entity (e.g. "Color.Pantone.XXX") or create a new entity (e.g. "color") that represents the enumerated type; n.b. matching range  (e.g. { red,green,blue,.. })

3) Assign selected enumerated type to column (n.b. enable additional type specific function in future)

3a) Identify aberrant data and cleanse

3b) Optimize encoding of enumerated set, e.g. bit-wise encoded (n.b. lazy evaluation)

4) Track provenance, control/versions appropriately, and repeat (w/ community, including open/shared entries in catalog for common entities and third-party entries for industry specific)

PART B: Using the Catalog

5) Quickly identify information of interest via full-text as well as faceted search (n.b. Amazon shopping)

6) Understand provenance, semantics, domain, range, etc.. and availability of information identified

7) Find and utilize information-associated methods and apparatus to access, transform, analyze/train, visualize, and inspect (e.g. Jupyter notebook requiring Parquet in COS of inputs { X, Y, Z } with X, Y, Z being defined in the Catalog or SystemML script).

8) Automatically generate  a plan to consume from available information and present as either "cheap" or "quick" options (maybe some intervals in-between, depending on plans available, ...)

  • Guest
  • Dec 14 2018
  • Needs review
Why is it useful?
Who would benefit from this IDEA? As a data scientist I need to know a lot about the data that is not captured in the catalog and I need that information to be shared with colleagues, both internally as well as externally.
How should it work?
Idea Priority High
Priority Justification
Customer Name
Submitting Organization Other
Submitter Tags
  • Attach files
  • Guest commented
    December 14, 2018 17:58

    There are a lot of parts in there, but the basics start with breaking down my Strings into enumerated sets; everything else just cascades from there.