Add Data Source - trillium_discovery - 17.1

Trillium Discovery Center

Product type
Software
Portfolio
Verify
Product family
Trillium
Product
Trillium > Trillium Discovery
Version
17.1
Language
English
Product name
Trillium Discovery
Title
Trillium Discovery Center
Topic type
Overview
Administration
Configuration
Installation
Reference
How Do I
First publish date
2008

The final step in adding a data source is to select data load properties. The load properties you specify determine the type of data source added and how your data is copied (or linked) to the current repository.

For example, you may want to create a profiled (fully-loaded) data source that includes only a sample number of records so that you can run business rules and view metadata on a subset of your data. This may be a more efficient way to determine your data quality standards before you load the entire data set.

A dynamic data source is ideal for quickly examining very large external data sources, but can also be useful when analyzing the results from data samples to validate that your rules and standards meet your data quality requirements.

Note: If you are creating a dynamic data source, you will not have access to metadata such as key and dependency analysis until you load the data into a repository. With a dynamic data source, if the external source changes, the data displayed in the Discovery Center also changes.

To add a data source to the repository

  1. In the Add Data Source window, click the Add tab.
  2. The Task name field is populated with the name of the data source file. Keep this default, or change the name. You can edit this name at any time after you add the data source.
  3. In the Add as section, select one of the following data source types: Profiled Data Source (the default) or Dynamic Data Source.
    Note: HDFS delimited data sources are added as profiled data sources only.
  4. Select the following data load options:
    Note: For HDFS data sources, loading sample data rows or skipping data rows is not supported. All data rows will be loaded by default.
    Option Description
    Load all rows

    Load all of the data rows.

    Note: Not available when adding dynamic data sources.

    Load sample rows

    • First [number rows] - Load a selected number of records from the beginning of the file. The default is 1000.
    • Random [percent] % of sample - Randomly sample a percentage of records from the file. The default is 20. Valid values are 0.01 to 99.99. The random % is the percentage chance of rows being included. Therefore, the actual number of rows loaded may be different for each load of the same file, even if you specify the same percentage.
    Note: Not available when adding dynamic data sources.
    Skip first [number] rows

    Allows you to specify a starting row for data imports. The default is 1000.

    For example, if your file has 300 rows and you select Load all rows and Skip first 99 rows, The Discovery Center will load 201 rows, starting with the 100th row.

  5. For profiled data sources, select Schedule job options:
    • Select Now (the default) to schedule the job to run when you click Finish.
    • Select Later to schedule the job to run at a later time. The current date and the time at the next half hour display. To change the date, click the calendar icon () and select a starting date. To change the time, click the clock icon () and select a time you want the job to run on the selected date.
  6. Click Summary to review all selections and to make edits as needed. Click Back or In Progress to return to the Add tab.
  7. Do one of the following:
    • To add the data source, click Finish. The Add tab content is replaced with a message confirming the data source will be added to the repository.
    • To close the window without saving your work, click Cancel or the X icon.
  8. Do one of the following:
    • Click Add Another Data Source to continue adding data sources.
    • Click Done to close the window.
  9. View the details of the add data source job in the Task Manager. ()
    Profiled data sources only.
    No background task is logged in the Task Manager when a dynamic data source is created because there is a limited analysis run.