Examines database data to determine its data type and statistical composition, and outputs a detailed JSON description that can be used for further analysis.
JDBC_RESULT_QUERY_FORMAT
option in the DbUrl or DbOptions property. Note that this is a breaking change if you have configured this node in Data360 Analyze prior to the 3.16.0 release. If you have existing nodes configured to connect to Snowflake using our shipped driver, where the DbUrl property has been set explicitly or if you have not set the
DbType property, you must add the JDBC_RESULT_QUERY_FORMAT
option to prevent the node from failing. For more information on data types, see Data types.
The output analysis from the node is provided in three formats:
- The analysis output pin provides data profile information which you can use to analyze the profile metrics of the database, and consists of a row for each field that the node has profiled, with the properties of each field contained in separate columns.
- The detail output pin provides a JSON description containing the field properties that you can use for further analysis, and consists of a row for each field that the node has profiled.
- The errors output pin provides details of any analysis errors that may take place. See the FailureBehavior property for more details.
Configuring the Database Profiler node
To configure the Database Profiler node and connect to an external database:
- Configure the required properties to connect to your database. If you are connecting to
a database that uses a driver which is pre-installed with Data360 Analyze, follow the Simple database
connection steps. Or, if you are connecting to a database that uses a driver
which is not shipped with Data360 Analyze, see Advanced database connection.Tip: For a list of the drivers which are included with your installation, see Acquiring data from a database.
- If you are working with a large database, it is recommended that you filter your queries as appropriate, using the options under Properties > Configure > Filtering in the right-hand panel.
Logical Type detection
You can add your own Logical types - also known as semantic types - to those detected by default by the Database Profiler node. To do this, provide a JSON specification in a file that you upload to the server, and use the path to that file as the value of the LogicalTypeDefinitions property on the node.
For example, to detect possible UK National Insurance number values in a data field, you could use the following JSON:
[
{
"semanticType": "UK.NI",
"pluginType": "regex",
"validLocales": [
{
"localeTag": "*",
"matchEntries" : [{ "regExpReturned": "\\s*[a-zA-Z]{2}(\\s*\\d\\s*){6}[a-zA-Z]\\s*" } ]
}],
"threshold": 98,
"baseType": "STRING"
}
]
When you supply input data to the Database Profiler node, types can be identified by using the JSON when the following criteria are met:
- If a string field contains values that are matched by the regular expression defined in
the
regExpReturned
property in the JSON, -
AND, if the percentage of values for a field that are identified as the specified
type is equal to or greater than the value specified by
threshold
. The value is given as a percentage, so in this example, a minimum of 98% of the values must be detected as a match. - When a match is found, the output analysis identifies the Type Qualifier for the
field as the value given by the
semanticType
property - in this caseUK.NI
- and the value of theValidation
column for the field is the regular expression defined in theregExp
property in the JSON.
For more information, see https://github.com/tsegall/fta
Supported Database
The following two databases are supported for the node,
- MS SQL Server
- Oracle
The following list of databases are also configured to work with the node, however performance may vary:
- DB2
- MySQL / MariaDB
- Postgres
- Redshift
- Snowflake
- Teradata
- Google Big Query
- Hana
Properties
DbUser
Optionally specify the database username to connect to the database.
DbPassword
Optionally specify the password used to connect to the database.
DbType
Select a database type to connect to. If you are connecting to a database that uses a driver which is not pre-installed with Data360 Analyze, then this property is not required. In this case, specify the connection details in the Advanced property group.
DbName
Specify the name of the database to connect to.
If you are connecting to a database that uses a driver which is not pre-installed with Data360 Analyze, then this property is not required. In this case, specify the connection details in the Advanced property group.
DbHost
Specify the database host to connect to e.g. localhost
. Leave this property blank if connecting to MS Access.
If you are connecting to a database that uses a driver which is not pre-installed with Data360 Analyze, then this property is not required. In this case, specify the connection details in the Advanced property group.
DbUrl
Optionally specify the driver's database URL. If a database is selected from the DbType property and a DbName and DbHost are specified, then this property is not required.
Otherwise, specify the URL in the following format: jdbc:subprotocol:subname
. For example, jdbc:sqlserver://localhost:1433;databaseName=CustomerData
. Provided by the driver's vendor.
See the help topic Acquiring data from a database for more information on configuring this property.
DbDriver
Optionally specify the class name of your JDBC driver.
If a database is selected from the DbType property and a DbName and DbHost are specified, then this property is not required. Otherwise, specify the class name of your JDBC driver, as provided by the driver's vendor.
See the help topic Acquiring data from a database for more information on configuring this property.
DbOptions
Optionally specify database connection properties e.g. key=val&key=val2
or key=value
entry per line.
DbDriverClasspath
Optionally specify the database driver classpath.
If a database is selected from the DbType property and a DbName and DbHost are specified, then this property is not required. Otherwise, specify the classpath with each classpath entry on a new line.
Each classpath entry must be the location on the file system of either a single jar file, or a directory containing jars or class files.
DbDriverNativeLibraryDirectories
List of directories to search in for native libraries e.g. DLLs and shared libraries (.so).
The directories will be used by the JVM search for any native libraries needed by the node or any of it dependencies. More specifically, the paths will be searched for any call to System.load(...)
and System.loadLibrary(...)
.
Additionally, the paths will be used when searching for the pre-load native libraries specified by the DbDriverNativeLibraries property.
The directories must be absolute paths, or resolve to absolute paths via substitution. The set of allowable directories is also limited by the server side property ls.brain.node.jdbc.dbDriverNativeLibraryWhitelistPaths
.
Any directory that does not satisfy these conditions will be ignored. Set the LogLevel property to 0
to trace library resolution.
DbDriverNativeLibraries
A list of names, paths, or patterns of native libraries to pre-load into the JVM prior to loading the driver. The specified entries will be searched for in the directories listed in the DbDriverNativeLibraryDirectories property.
Note that this will only load the libraries in the JVM memory space, and will not register them for JNI binding. It is assumed that the driver itself will make the appropriate calls to System.load()
and System.loadLibrary()
.
The primary use of this property is to load transitive dependencies for native libraries that would have normally be required to be specified via an environment variable e.g. PATH
or LD_LIBRARY_PATH
. The values can be in the following forms:
- A library name which will be matched using the platforms specific naming library naming scheme, e.g. for Linux, "foo" would match "libfoo.so", for windows, it would match "foo.dll".
- An absolute path.
- A glob pattern of the form:
glob:
- A regex pattern of the form:
regex:
For both glob and regex patterns, the pattern will be matched against the entire absolute path of the library files.
ProfileMode
Choose from:
- List Only - List the Tables and Columns discovered.
- Count Only - List the Tables and Columns discovered along with the number of rows per table.
- Profile - Execute a profile on all Tables & Columns discovered.
The default value is List Only.
CatalogFilter
Optionally specify a list (comma separated) of names to constrain the Catalogs retrieved.
- For the CatalogFilter property, you must type the name of a
Catalog and an exact match is performed. For the other filter properties, you can type
either a name, or you can type a string followed by the
%
character. - You can either use CatalogFilter or ExcludedCatalogs, but not both.
SchemaFilter
Optionally specify a list (comma separated) of filters to constrain the Schemas retrieved. When using a filter, "%" means match any substring of 0 or more characters, and "_" means match any one character.
ExcludedCatalogs
Optionally specify a list (comma separated) of catalogs that will be ignored.
ExcludedSchemas
Optionally specify a list (comma separated) of schemas that will be ignored.
ExcludedTablesViews
Optionally specify a list (comma separated) of tables or views that will be ignored.
TableViewFilter
Optionally specify a list (comma separated) of filters to constrain the Tables/Views retrieved. When using a filter, "%" means match any substring of 0 or more characters, and "_" means match any one character.
ExcludeSystemCatalogs
Optionally specify whether to exclude the system predefined Catalogs or not. The default value is True.
ExcludeSystemSchemas
Optionally specify whether to exclude the system predefined Schemas or not. The default value is True.
DefaultToUnicode
Optionally specify whether string columns should default to Unicode type. If set to False, string columns default to string type. The default value is True.
UnicodeBinding
Optionally specify how Unicode strings should be passed to the JDBC driver. Choose from:
- CHARACTER - Bind the string using the mechanism for CHAR and VARCHAR types.
- NATIONAL CHARACTER - Bind the string using the mechanism for NCHAR and NVARCHAR types.
SqlDateType
Optionally select a datetype in which to map a SQL DATE. Choose from:
- date
- datetime
Sample Size
Optionally specify the size of the sample set of the input data to analyze. The default is to process 100k rows.
Max Profile Size
If there are less than MaxProfileSize records in a table being profiled, then the reported null count will be the total number of nulls in the table as reported by the database. If, however, there are more than MaxProfileSize records in the table, then the null count will always be 0. Previously the sampling done by the node could include null values which could skew the profiling results and the null count was based on the number of nulls in the sample. Now, the sampling will not include null values therefore if no overall query on the table can be done due to the MaxProfileSize setting, the null count will always be zero.
Distinct Approximation Size
Used to control the threshold for using an approximation to calculate the number of distinct values.
Any table with a size smaller than this threshold will use COUNT(DISTINCT()) any table larger than this will use an approximation if supported by the database. The default value is 1,000,000, set to -1 to disable.
Maximum Cardinality
Set the maximum cardinality that will be tracked. This value represents the number of discrete values for a field that will be tracked in order to analyze a field. The default value is 12000.
JSONDetail Level
Control the level of detail related to valid, invalid, and shape sets provided in the JSON output.
- 0 - just counts.
- 1 - first 100 rows.
- 2 - all rows.
The default value is 1.
Locale
Optionally specify the Locale to be used for analyzing the input stream.
Locale must be specified in the format <language code>[_<country code>[_<variant code>]]
For example, to set the locale to English, the Locale property should be set to en.
For US English, the Locale property should be set to en_US.
For Austrian German, the Locale property should be set to de_AT.
The default value is the default Locale under which the server is running.
Enable Default Logical Types
Enable the default set of Logical Types, which is based on the locale. The default value is True.
LogicalTypeDefinitions
Optionally provide the path to a JSON specification for additional Logical Types to be detected. For details, see Logical Type detection.
Choose the (from Field) variant of this property to look up the value from an input field with the name specified.
FailureBehavior
Optionally specify what to do when an error is detected. Choose from:
- Error - Report error and stop further processing.
- Log - Log a warning message.
- Ignore - Ignore the error.
NativeLibraryDirectories
List of directories to search in for native libraries, e.g. DLLs and shared libraries (.so).
The directories will be used by the JVM search for any native libraries needed by the node
or any of its dependencies. More specifically, the paths will be searched for any call to
System.load(...)
and System.loadLibrary(...)
.
Additionally, the paths will be used when searching for the pre-load native libraries
specified by the NativeLibraryPreloads property.
The directories must be absolute paths, or resolve to absolute paths via substitution. The
set of allowable directories is also limited by the server side property
ls.brain.node.java.nativeLibraryWhitelistPaths
. Any directory that does
not satisfy these conditions will be ignored. Set the LogLevel
property to 0
to trace library resolution.
NativeLibraryPreloads
A list of names, paths, or patterns of native libraries to pre-load into the JVM prior to running the node. The specified entries will be searched for in the directories listed in the NativeLibraryDirectories property.
System.load()
and
System.loadLibrary()
. The primary use of this property is to load transitive dependencies for native libraries that would have normally be required to be specified via an environment variable, e.g. PATH
or LD_LIBRARY_PATH
. The values can be in the following forms:
- A library name which will be matched using the platforms specific naming library naming scheme, e.g. for Linux, "foo" would match "libfoo.so", for windows, it would match "foo.dll".
- An absolute path
- A glob pattern of the form:
glob:
- A regex pattern of the form:
regex:
For both glob and regex patterns, the pattern will be matched against the entire absolute path of the library files.
Inputs and outputs
Inputs: 1 optional.
Outputs: analysis, details, error.