Database Profiler - Data360_Analyze - Latest

Data360 Analyze Server Help

Product type
Software
Portfolio
Verify
Product family
Data360
Product
Data360 Analyze
Version
Latest
Language
English
Product name
Data360 Analyze
Title
Data360 Analyze Server Help
Copyright
2024
First publish date
2016
Last updated
2024-11-28
Published on
2024-11-28T15:26:57.181000

Examines database data to determine its data type and statistical composition, and outputs a detailed JSON description that can be used for further analysis.

Note: If you specify a DbType of Snowflake and do not specify a value in the DbUrl property, the node will be automatically configured to communicate with the Snowflake database using the "JDBC_RESULT_QUERY_FORMAT" set to JSON. When forming connections to a Snowflake database using the shipped driver, if you do not specify a DbType or if the DbUrl is set explicitly, you will need to manually configure the node to use the JDBC_RESULT_QUERY_FORMAT option in the DbUrl or DbOptions property. Note that this is a breaking change if you have configured this node in Data360 Analyze prior to the 3.16.0 release. If you have existing nodes configured to connect to Snowflake using our shipped driver, where the DbUrl property has been set explicitly or if you have not set the DbType property, you must add the JDBC_RESULT_QUERY_FORMAT option to prevent the node from failing.

For more information on data types, see Data types.

The output analysis from the node is provided in three formats:

  • The analysis output pin provides data profile information which you can use to analyze the profile metrics of the database, and consists of a row for each field that the node has profiled, with the properties of each field contained in separate columns.
  • The detail output pin provides a JSON description containing the field properties that you can use for further analysis, and consists of a row for each field that the node has profiled.
  • The errors output pin provides details of any analysis errors that may take place. See the FailureBehavior property for more details.

Configuring the Database Profiler node

To configure the Database Profiler node and connect to an external database:

  1. Configure the required properties to connect to your database. If you are connecting to a database that uses a driver which is pre-installed with Data360 Analyze, follow the Simple database connection steps. Or, if you are connecting to a database that uses a driver which is not shipped with Data360 Analyze, see Advanced database connection.
    Tip: For a list of the drivers which are included with your installation, see Acquiring data from a database.
  2. If you are working with a large database, it is recommended that you filter your queries as appropriate, using the options under Properties > Configure > Filtering in the right-hand panel.

Logical Type detection

You can add your own Logical types - also known as semantic types - to those detected by default by the Database Profiler node. To do this, provide a JSON specification in a file that you upload to the server, and use the path to that file as the value of the LogicalTypeDefinitions property on the node.

For example, to detect possible UK National Insurance number values in a data field, you could use the following JSON:


        [
	{
		"semanticType": "UK.NI",
		"pluginType": "regex",
		"validLocales": [ 
		{ 
			"localeTag":  "*", 
			"matchEntries" : [{ "regExpReturned": "\\s*[a-zA-Z]{2}(\\s*\\d\\s*){6}[a-zA-Z]\\s*" } ] 
		}],
		"threshold": 98,
		"baseType": "STRING"
	}
]
      

When you supply input data to the Database Profiler node, types can be identified by using the JSON when the following criteria are met:

  • If a string field contains values that are matched by the regular expression defined in the regExpReturned property in the JSON,
  • AND, if the percentage of values for a field that are identified as the specified type is equal to or greater than the value specified by threshold. The value is given as a percentage, so in this example, a minimum of 98% of the values must be detected as a match.
  • When a match is found, the output analysis identifies the Type Qualifier for the field as the value given by the semanticType property - in this case UK.NI - and the value of the Validation column for the field is the regular expression defined in the regExp property in the JSON.

For more information, see https://github.com/tsegall/fta

Supported Database

The following two databases are supported for the node,

  • MS SQL Server
  • Oracle

The following list of databases are also configured to work with the node, however performance may vary:

  • DB2
  • MySQL / MariaDB
  • Postgres
  • Redshift
  • Snowflake
  • Teradata
  • Google Big Query
  • Hana

Properties

DbUser

Optionally specify the database username to connect to the database.

DbPassword

Optionally specify the password used to connect to the database.

DbType

Select a database type to connect to. If you are connecting to a database that uses a driver which is not pre-installed with Data360 Analyze, then this property is not required. In this case, specify the connection details in the Advanced property group.

DbName

Specify the name of the database to connect to.

Note: If you're connecting to MS Access, enter the file path to your MSAccess database. If you're connecting to an Oracle database, enter the system identifier (SID).

If you are connecting to a database that uses a driver which is not pre-installed with Data360 Analyze, then this property is not required. In this case, specify the connection details in the Advanced property group.

DbHost

Specify the database host to connect to e.g. localhost. Leave this property blank if connecting to MS Access.

If you are connecting to a database that uses a driver which is not pre-installed with Data360 Analyze, then this property is not required. In this case, specify the connection details in the Advanced property group.

DbUrl

Optionally specify the driver's database URL. If a database is selected from the DbType property and a DbName and DbHost are specified, then this property is not required.

Otherwise, specify the URL in the following format: jdbc:subprotocol:subname. For example, jdbc:sqlserver://localhost:1433;databaseName=CustomerData. Provided by the driver's vendor.

See the help topic Acquiring data from a database for more information on configuring this property.

DbDriver

Optionally specify the class name of your JDBC driver.

If a database is selected from the DbType property and a DbName and DbHost are specified, then this property is not required. Otherwise, specify the class name of your JDBC driver, as provided by the driver's vendor.

See the help topic Acquiring data from a database for more information on configuring this property.

DbOptions

Optionally specify database connection properties e.g. key=val&key=val2 or key=value entry per line.

DbDriverClasspath

Optionally specify the database driver classpath.

If a database is selected from the DbType property and a DbName and DbHost are specified, then this property is not required. Otherwise, specify the classpath with each classpath entry on a new line.

Each classpath entry must be the location on the file system of either a single jar file, or a directory containing jars or class files.

DbDriverNativeLibraryDirectories

List of directories to search in for native libraries e.g. DLLs and shared libraries (.so).

The directories will be used by the JVM search for any native libraries needed by the node or any of it dependencies. More specifically, the paths will be searched for any call to System.load(...) and System.loadLibrary(...).

Additionally, the paths will be used when searching for the pre-load native libraries specified by the DbDriverNativeLibraries property.

The directories must be absolute paths, or resolve to absolute paths via substitution. The set of allowable directories is also limited by the server side property ls.brain.node.jdbc.dbDriverNativeLibraryWhitelistPaths.

Any directory that does not satisfy these conditions will be ignored. Set the LogLevel property to 0 to trace library resolution.

DbDriverNativeLibraries

A list of names, paths, or patterns of native libraries to pre-load into the JVM prior to loading the driver. The specified entries will be searched for in the directories listed in the DbDriverNativeLibraryDirectories property.

Note that this will only load the libraries in the JVM memory space, and will not register them for JNI binding. It is assumed that the driver itself will make the appropriate calls to System.load() and System.loadLibrary().

The primary use of this property is to load transitive dependencies for native libraries that would have normally be required to be specified via an environment variable e.g. PATH or LD_LIBRARY_PATH. The values can be in the following forms:

  • A library name which will be matched using the platforms specific naming library naming scheme, e.g. for Linux, "foo" would match "libfoo.so", for windows, it would match "foo.dll".
  • An absolute path.
  • A glob pattern of the form: glob:
  • A regex pattern of the form: regex:

For both glob and regex patterns, the pattern will be matched against the entire absolute path of the library files.

ProfileMode

Choose from:

  • List Only - List the Tables and Columns discovered.
  • Count Only - List the Tables and Columns discovered along with the number of rows per table.
  • Profile - Execute a profile on all Tables & Columns discovered.

The default value is List Only.

CatalogFilter

Optionally specify a list (comma separated) of names to constrain the Catalogs retrieved.

Note:
  • For the CatalogFilter property, you must type the name of a Catalog and an exact match is performed. For the other filter properties, you can type either a name, or you can type a string followed by the % character.
  • You can either use CatalogFilter or ExcludedCatalogs, but not both.

SchemaFilter

Optionally specify a list (comma separated) of filters to constrain the Schemas retrieved. When using a filter, "%" means match any substring of 0 or more characters, and "_" means match any one character.

Note: You can either use SchemaFilter or ExcludedSchemas, but not both.

ExcludedCatalogs

Optionally specify a list (comma separated) of catalogs that will be ignored.

Note: You can either use CatalogFilter or ExcludedCatalogs, but not both.

ExcludedSchemas

Optionally specify a list (comma separated) of schemas that will be ignored.

Note: You can either use SchemaFilter or ExcludedSchemas, but not both.

ExcludedTablesViews

Optionally specify a list (comma separated) of tables or views that will be ignored.

Note: You can either use TableViewFilter or ExcludedTablesViews - but not both.

TableViewFilter

Optionally specify a list (comma separated) of filters to constrain the Tables/Views retrieved. When using a filter, "%" means match any substring of 0 or more characters, and "_" means match any one character.

ExcludeSystemCatalogs

Optionally specify whether to exclude the system predefined Catalogs or not. The default value is True.

ExcludeSystemSchemas

Optionally specify whether to exclude the system predefined Schemas or not. The default value is True.

DefaultToUnicode

Optionally specify whether string columns should default to Unicode type. If set to False, string columns default to string type. The default value is True.

UnicodeBinding

Optionally specify how Unicode strings should be passed to the JDBC driver. Choose from:

  • CHARACTER - Bind the string using the mechanism for CHAR and VARCHAR types.
  • NATIONAL CHARACTER - Bind the string using the mechanism for NCHAR and NVARCHAR types.

SqlDateType

Optionally select a datetype in which to map a SQL DATE. Choose from:

  • date
  • datetime

Sample Size

Optionally specify the size of the sample set of the input data to analyze. The default is to process 100k rows.

Max Profile Size

If there are less than MaxProfileSize records in a table being profiled, then the reported null count will be the total number of nulls in the table as reported by the database. If, however, there are more than MaxProfileSize records in the table, then the null count will always be 0. Previously the sampling done by the node could include null values which could skew the profiling results and the null count was based on the number of nulls in the sample. Now, the sampling will not include null values therefore if no overall query on the table can be done due to the MaxProfileSize setting, the null count will always be zero.

Distinct Approximation Size

Used to control the threshold for using an approximation to calculate the number of distinct values.

Any table with a size smaller than this threshold will use COUNT(DISTINCT()) any table larger than this will use an approximation if supported by the database. The default value is 1,000,000, set to -1 to disable.

Maximum Cardinality

Set the maximum cardinality that will be tracked. This value represents the number of discrete values for a field that will be tracked in order to analyze a field. The default value is 12000.

JSONDetail Level

Control the level of detail related to valid, invalid, and shape sets provided in the JSON output.

  • 0 - just counts.
  • 1 - first 100 rows.
  • 2 - all rows.

The default value is 1.

Locale

Optionally specify the Locale to be used for analyzing the input stream.

Locale must be specified in the format <language code>[_<country code>[_<variant code>]]

For example, to set the locale to English, the Locale property should be set to en.

For US English, the Locale property should be set to en_US.

For Austrian German, the Locale property should be set to de_AT.

The default value is the default Locale under which the server is running.

Enable Default Logical Types

Enable the default set of Logical Types, which is based on the locale. The default value is True.

LogicalTypeDefinitions

Optionally provide the path to a JSON specification for additional Logical Types to be detected. For details, see Logical Type detection.

Choose the (from Field) variant of this property to look up the value from an input field with the name specified.

FailureBehavior

Optionally specify what to do when an error is detected. Choose from:

  • Error - Report error and stop further processing.
  • Log - Log a warning message.
  • Ignore - Ignore the error.

NativeLibraryDirectories

List of directories to search in for native libraries, e.g. DLLs and shared libraries (.so).

The directories will be used by the JVM search for any native libraries needed by the node or any of its dependencies. More specifically, the paths will be searched for any call to System.load(...) and System.loadLibrary(...). Additionally, the paths will be used when searching for the pre-load native libraries specified by the NativeLibraryPreloads property.

The directories must be absolute paths, or resolve to absolute paths via substitution. The set of allowable directories is also limited by the server side property ls.brain.node.java.nativeLibraryWhitelistPaths. Any directory that does not satisfy these conditions will be ignored. Set the LogLevel property to 0 to trace library resolution.

NativeLibraryPreloads

A list of names, paths, or patterns of native libraries to pre-load into the JVM prior to running the node. The specified entries will be searched for in the directories listed in the NativeLibraryDirectories property.

Note: The pre-load will only load the libraries in the JVM memory space, and will not register them for JNI binding. It is assumed that the code itself will make the appropriate calls to System.load() and System.loadLibrary().

The primary use of this property is to load transitive dependencies for native libraries that would have normally be required to be specified via an environment variable, e.g. PATH or LD_LIBRARY_PATH. The values can be in the following forms:

  • A library name which will be matched using the platforms specific naming library naming scheme, e.g. for Linux, "foo" would match "libfoo.so", for windows, it would match "foo.dll".
  • An absolute path
  • A glob pattern of the form: glob:
  • A regex pattern of the form: regex:

For both glob and regex patterns, the pattern will be matched against the entire absolute path of the library files.

Inputs and outputs

Inputs: 1 optional.

Outputs: analysis, details, error.