Generates Frequent Itemsets and Association Rules from transactional data.
This node uses the embedded R engine to identify frequent groupings of items (itemsets) within the transactional data and finds relationships between items to generate association rules. The node uses the Apriori data mining algorithm to identify the frequent itemsets.
Association rules are of the form:
{left hand side (LHS) items} ==> {right hand side (RHS) items}
where the rule says that if the LHS items are present in a transaction then the RHS items are also likely to be present. For instance, a simple example for a retail transaction could be that a customer who buys milk and bread (the LHS) is also likely to buy butter (the RHS).
Transactional data are supported in two formats:
- Basket where a record contains a delimited list of the items in the transaction in a single field.
- Single where a record contains a transaction identifier field and a field containing the item identifier.
A minimum Support level can be set. Support is the proportion of transactions that contain the items in a particular itemset.
A minimum Confidence level can be set for an association rule. Confidence is the conditional probability of one itemset being in a transaction given the presence of another (antecedent) itemset, i.e. the probability of finding the RHS itemset of the rule in the transactions under the condition that these transactions also contain the LHS itemset.
An antecedent itemset can be specified to restrict the rules generated by the node to those containing the specified items. The minimum size and maximum size of an itemset can also be specified.
The association rule model can optionally be saved to a file.
When run, the node provides a summary of the rules and details of each rule.
The Summary pin contains a summary statistics for the rules and includes information on:
- The number of rules generated.
- The distribution of the length of the rules in terms of the total number of items in the rule (in both the LHS and RHS of the rule).
- Minimum, maximum and quartile statistics.
- Quality statistics for Support, Confidence and Lift.
- The number of transactions analyzed.
- The Support and Confidence values used when deriving the rules.
If the serialized model is saved to a file the Summary pin also includes the path to the file containing the serialized model.
The Results pin contains details of the association rules and includes information on:
- A list of the items in the left hand side (antecedent) itemset.
- A list of the items in the right hand side (consequent) itemset.
- Support value for the rule.
- Confidence value for the rule.
- Lift value for the rule.
Lift is the ratio of the observed support to that expected if the LHS and RHS were independent and is a measure of how likely the rule is to be not a coincidence (i.e. a Lift value of 1 would imply the association was purely random chance).
If the data is coming from a file and the TransactionFormat is Basket then the contents must not contain a header record and only contain one column which has the Item identifiers.
Powered by TIBCO®
Properties
ModelName
Optionally specify the name of a model which is displayed on the output data. When the node is configured to write the serialized model to a file, the model name is also used as the output filename.
A model name must start with a letter and may contain any of the following:
- letters
- numbers
- period character (".")
- underscore ("_")
If not specified, a default model name is displayed on the output data.
File
Specify the absolute filepath to the file containing the transactions.
This property is optional if the source of the transaction data is an input pin. The property must be specified when there is no input pin.
ItemField
Specify the field containing the Item Identifiers.
A value is required for this property when the data is coming from an input pin or when the data is from a File and TransactionFormat is Single.
If the data is from a File and TransactionFormat is Basket this property must be left empty.
TransactionFormat
Optionally specify the format of the transactions. Choose from:
- Basket - Each line in the transaction data represents a transaction where the items (item labels) are separated by the character specified by the TranactionSeparator property.
- Single - Each line corresponds to a single item, containing at least ids for the transaction and the item.
The default value is Basket.
TransactionSeparator
Optionally specify the character used to separate items in a transaction when the transaction data are in 'basket' format. The default value is ",".
TransactionIdField
Optionally specify the field containing the Transaction Identifier. If the TransactionFormat property is set to Single the field containing the Transaction Identifier must be specified.
MinimumSupportLevel
Optionally specify the minimum level of support for an association rule. Must be a positive numeric value - decimal or integer.
The maximum value allowed in this property is 1.0. The default value is 0.1.
MinimumConfidenceLevel
Optionally specify the minimum level of confidence for an association rule. Must be a positive numeric value - decimal or integer. The maximum value allowed in this property is 1.0. The default value is 0.5.
ModelOutputMode
Optionally specify whether the serialized model is written to a file on disk.
This property also determines how ModelOutputField and ModelOutputDirectory behave. The default value is None.
ModelOutputField
Optionally specify a name for the output field that contains the full path of the file where the serialized model has been written. The default value is "mb_ModelOutput".
ModelOutputDirectory
Specify the directory where the serialized model is written when ModelOutputMode is set to File. When ModelOutputDirectory is blank, files are written to the Data360 Analyze temporary directory. Otherwise, the files are written to the specified directory - the specified directory must exist and be writeable. This node will not overwrite existing files by default. This behavior can be set in the ExceptionBehavior tab.
This property should only be filled in when ModelOutputMode is set to File.
AssociationRuleLHS
Optionally specify the items that are to appear in the left-hand-side of the rules (antecedents). Where multiple items are to appear in the LHS itemset, the items should be delimited by the separator specified in the TransactionSeparator property.
If not specified all items will be considered for inclusion in a frequent itemsets.
MinimumItemsetSize
Optionally specify the minimum number of items that are included in an itemset. Must be an integer value. The default value is 1.
MaximumItemsetSize
Optionally specify the maximum number of items that are included in an itemset. Must be an integer value. If configured, this has the effect of limiting the size of an itemset. The default value is 10.
FileExistsBehavior
Optionally specify whether an existing serialized model file will be overwritten. Choose from:
- Error - Generate an error and do not overwrite the file.
- Log - Log a warning message and do not overwrite the file.
- Ignore - Do not overwrite the file.
- Overwrite - Overwrite the file.
The default value is Error.
Inputs and outputs
Inputs: 1 optional.
Outputs: Summary, Results.