Skip to main content

Financial Transaction Anomaly Detection and Data Lineage with Delta Lake


Introduction

In the financial industry, detecting anomalies in transaction data is critical for fraud prevention, compliance, and risk management. Unusual transaction amounts, irregular transaction types, and atypical patterns can indicate fraudulent activities or errors that require immediate attention. Leveraging big data technologies and advanced data processing frameworks can enhance the ability to detect such anomalies effectively.

Superset ilum sql transaction dashboard

This example presents a simplified use case that demonstrates how to perform financial transaction anomaly detection using Spark SQL and Delta Lake. The solution also integrates with the OpenLineage framework to capture detailed data lineage information throughout the data processing pipeline.

info

The same thing can be accomplished with the use of Iceberg or Hudi. We chose Delta because it's a default table format in Ilum.

BI dashboard is done with the use of Apache Superset, which is one of the modules available in Ilum. You can find the dashboard for this use case here


Dataset Description

The dataset used in this use case consists of simulated financial transaction records with the following columns:

  • TransactionTimestamp: Timestamp of when the transaction occurred.
  • TransactionID: Unique identifier for each transaction.
  • AccountID: Unique identifier for the bank account involved.
  • Amount: Monetary value of the transaction.
  • Merchant: The merchant or entity involved in the transaction.
  • TransactionType: Type of transaction (e.g., Purchase, Withdrawal, Transfer).
  • Location: Geographical location where the transaction took place.

Sample Data:

TransactionTimestampTransactionIDAccountIDAmountMerchantTransactionTypeLocation
2023-01-01 08:00:00TXN1127ACC495071.92MerchantHPurchaseTokyo
2023-01-01 08:01:00TXN1639ACC1015607.89MerchantHPurchaseLondon
2023-01-01 08:02:00TXN872ACC865092.34MerchantEWithdrawalLondon
2023-01-01 08:03:00TXN2001ACC4500.50MerchantATransferNew York
2023-01-01 08:04:00TXN2002ACC101200.00MerchantBPurchaseParis
2023-01-01 08:05:00TXN2003ACC8300.00MerchantCWithdrawalTokyo

Objectives

The primary objectives of this use case are:

  • Anomaly Detection: Identify transactions that are anomalies based on the amount exceeding an upper limit derived from the data distribution.
  • Irregular Transaction Identification: Detect irregular transaction types for specific accounts.
  • Pattern Recognition: Recognize unusual patterns based on transaction timestamps and locations.
  • Behavioral Analysis: Track spending behaviors based on merchants.
  • Statistical Summaries: Compute total transactions, average, minimum, and maximum transaction amounts per account.
  • Delta Lake Integration: Utilize Delta Lake features such as time travel, change data feed, and schema evolution to enhance data processing.
  • Data Lineage Tracking: Capture and visualize data lineage using the OpenLineage framework.

Solution Overview

The solution involves creating a series of Delta Lake tables to process and analyze the transaction data. The steps include:

  1. Data Ingestion: Load the transaction data into a Delta Lake table.
  2. Statistical Analysis: Calculate statistical parameters to define anomaly thresholds.
  3. Anomaly Detection: Identify transactions and accounts that exceed the defined limits.
  4. Pattern Analysis: Detect irregular transaction types and temporal patterns.
  5. Behavioral Analysis: Analyze spending behaviors based on merchants.
  6. Data Enhancement: Use Delta Lake features for data updates, schema evolution, and time travel.
  7. Reporting: Generate summary reports for business insights.
  8. Data Lineage: Leverage OpenLineage to capture data lineage throughout the pipeline.

Technical Implementation

Step 1: Create the Transactions Delta Table

We start by creating the transactions table as a Delta Lake table to store the financial transaction data.

SQL Code:

CREATE TABLE transactions (
TransactionTimestamp TIMESTAMP COMMENT 'Date and time when the transaction occurred',
TransactionID STRING COMMENT 'Unique identifier for the transaction',
AccountID STRING COMMENT 'Unique identifier for the account involved',
Amount DECIMAL(18, 2) COMMENT 'Monetary value of the transaction',
Merchant STRING COMMENT 'Merchant involved in the transaction',
TransactionType STRING COMMENT 'Type of the transaction (e.g., Purchase, Withdrawal)',
Location STRING COMMENT 'Location where the transaction occurred'
)
USING DELTA
COMMENT 'Table containing financial transactions';

Insert Sample Data:

info

Link to the example file can be found here

It is necessary to upload the file to the storage linked to your Ilum instance.

INSERT OVERWRITE TABLE transactions
SELECT
_c0,_c1,_c2,_c3,_c4,_c5,_c6
FROM csv.`s3a://ilum-data/financial_anomaly_data.csv`

Explanation:

  • USING DELTA: Specifies that the table is a Delta Lake table.
  • COMMENTS: Provide metadata for better documentation and understanding.

Sample Output:

Superset ilum sql transaction dashboard

Sample Visualization:

Superset ilum sql transaction dashboard

Step 2: Detecting Anomalies in Transaction Amounts

We aim to identify transactions where the amount exceeds an upper limit based on the distribution of the Amount column.

2.1 Calculate Statistical Parameters

SQL Code:

CREATE VIEW amount_stats AS
SELECT
AVG(Amount) AS avg_amount,
STDDEV(Amount) AS stddev_amount,
PERCENTILE(Amount, 0.95) AS percentile_95
FROM
transactions;

Explanation:

  • AVG(Amount): Calculates the average transaction amount.
  • STDDEV(Amount): Calculates the standard deviation, useful for understanding data dispersion.
  • PERCENTILE(Amount, 0.95): Computes the 95th percentile, used as the upper limit for normal transactions.

Sample Output:

ilum sql results

2.2 Identify Transactions Exceeding the Limit

SQL Code:

CREATE VIEW anomalies
AS
SELECT
TransactionID,
AccountID,
Amount,
CASE
WHEN Amount > (SELECT percentile_95 FROM amount_stats) THEN 'ExceedsLimit'
ELSE 'WithinLimit'
END AS AnomalyStatus
FROM
transactions;

Explanation:

  • AnomalyStatus: A new column indicating whether the transaction exceeds the limit.

Sample Output:

TransactionIDAccountIDAmountAnomalyStatus
TXN1127ACC495071.92ExceedsLimit
............
ilum sql results

Sample Visualization:

ilum sql results

Step 3: Calculating Account Statistics

Compute total transactions, average, minimum, and maximum transaction amounts per account.

SQL Code:

CREATE TABLE account_totals
USING DELTA
AS
SELECT
AccountID,
COUNT(TransactionID) AS TransactionCount,
SUM(Amount) AS TotalAmount,
AVG(Amount) AS AverageAmount,
MIN(Amount) AS MinAmount,
MAX(Amount) AS MaxAmount
FROM
transactions
GROUP BY
AccountID;

Explanation:

  • GROUP BY AccountID: Aggregates data per account.
  • Aggregate Functions: Compute statistical measures for each account.

Sample Output:

AccountIDTransactionCountTotalAmountAverageAmountMinAmountMaxAmount
ACC4295572.4247786.21500.5095071.92
..................
ilum sql results

Step 4: Identifying Accounts Exceeding Limits

Identify accounts that have transactions exceeding the predefined limit.

4.1 Determine Exceeding Accounts

SQL Code:

CREATE VIEW exceeding_accounts AS
SELECT DISTINCT AccountID
FROM anomalies
WHERE AnomalyStatus = 'ExceedsLimit';

4.2 Count Exceeding and Non-Exceeding Accounts

SQL Code:

SELECT
(SELECT COUNT(*) FROM exceeding_accounts) AS ExceedingAccounts,
(SELECT COUNT(DISTINCT AccountID) FROM transactions) - (SELECT COUNT(*) FROM exceeding_accounts) AS NonExceedingAccounts;

Sample Output:

ExceedingAccountsNonExceedingAccounts
123
ilum sql results

Step 5: Recognizing Unusual Temporal Patterns

Detect transactions occurring outside of normal business hours (e.g., 9 AM to 5 PM).

SQL Code:

CREATE VIEW after_hours_transactions
AS
SELECT
*,
HOUR(TransactionTimestamp) AS TransactionHour
FROM
transactions
WHERE
HOUR(TransactionTimestamp) NOT BETWEEN 9 AND 17;

Explanation:

  • HOUR(TransactionTimestamp): Extracts the hour from the timestamp.
  • NOT BETWEEN 9 AND 17: Filters transactions outside business hours.

Step 7: Analyzing Spending Behaviors by Merchant

Aggregate spending data to understand customer behaviors based on merchant interactions.

SQL Code:

CREATE TABLE merchant_spending
USING DELTA
AS
SELECT
Merchant,
COUNT(*) AS TransactionCount,
SUM(Amount) AS TotalSpent,
AVG(Amount) AS AverageTransactionAmount,
MIN(Amount) AS MinAmount,
MAX(Amount) AS MaxAmount
FROM
transactions
GROUP BY
Merchant;

Sample Output:

MerchantTransactionCountTotalSpentAverageTransactionAmountMinAmountMaxAmount
MerchantH2110679.8155339.9115607.8995071.92
..................
ilum sql results

Sample Visualization:

ilum sql results

Step 8: Leveraging Delta Lake Features

8.1 Enable Change Data Feed

Enable Delta Lake's change data feed to track changes in the transactions table.

SQL Code:

ALTER TABLE transactions
SET TBLPROPERTIES (delta.enableChangeDataFeed = true);

8.2 Perform Updates and Track Changes

Update transaction amounts due to corrections and track these changes.

SQL Code:

-- Update a transaction
UPDATE transactions
SET Amount = 1300.00
WHERE TransactionID = 'TXN1127';
-- Check the last version
DESCRIBE HISTORY transactions;
ilum sql results

-- Query changes using change data feed with proper table version
SELECT *
FROM table_changes('transactions', 3)
WHERE TransactionID = 'TXN1127';
ilum sql results

8.3 Schema Evolution: Add New Column

Add a new column FraudFlag to flag suspected fraudulent transactions.

SQL Code:

ALTER TABLE transactions
ADD COLUMNS (FraudFlag BOOLEAN COMMENT 'Flag indicating suspected fraud');

8.4 Update Fraudulent Transactions

Flag transactions exceeding the upper limit.

SQL Code:

UPDATE transactions
SET FraudFlag = TRUE
WHERE Amount > 95071.92;
ilum sql results

8.5 Time Travel and Table Restore

View Table History

SQL Code:

DESCRIBE HISTORY transactions;
ilum sql results
Query a Previous Version

SQL Code:

SELECT * FROM transactions VERSION AS OF 3;
Restore Table to a Previous Version

SQL Code:

RESTORE TABLE transactions TO VERSION AS OF 3;

Explanation:

  • Time Travel: Access previous versions of the data for auditing or recovery.
  • Table Restore: Revert the table to a known good state after unintended changes.

Step 9: Generating Final Summary Reports

Generate reports summarizing accounts exceeding limits and compute averages.

SQL Code:

CREATE TABLE account_exceeding_summary
USING DELTA
AS
SELECT
a.AccountID,
a.TransactionCount,
a.TotalAmount,
a.AverageAmount,
a.MinAmount,
a.MaxAmount,
CASE
WHEN e.AccountID IS NOT NULL THEN 'ExceedsLimit'
ELSE 'WithinLimit'
END AS ExceedingStatus
FROM
account_totals a
LEFT JOIN
exceeding_accounts e ON a.AccountID = e.AccountID;
ilum sql results
-- Compute averages for exceeding and non-exceeding accounts
SELECT
ExceedingStatus,
COUNT(AccountID) AS AccountCount,
AVG(TotalAmount) AS AverageTotalAmount,
AVG(AverageAmount) AS AverageTransactionAmount
FROM
account_exceeding_summary
GROUP BY
ExceedingStatus;
ilum sql results

Delta Lake Features Utilized

  • Delta Tables: All tables are created using Delta Lake to leverage ACID transactions and schema enforcement.
  • Time Travel: Access previous versions of data for auditing and recovery.
  • Table Restore: Restore tables to a previous state if necessary.
  • Change Data Feed: Track changes to data for incremental processing and auditing.
  • Schema Evolution: Add new columns (FraudFlag) without disrupting existing data.
  • Constraints and Checks: Enforce data integrity through schema definitions.
  • Partitioning: Although not explicitly used here, Delta Lake supports partitioning for performance optimization.

Data Lineage and OpenLineage Integration

By executing these SQL operations in an environment integrated with OpenLineage, each operation generates lineage metadata. This metadata captures:

  • Data Flow: How data moves from one table to another (e.g., from transactions to anomalies).
  • Transformations: The specific transformations applied to the data (e.g., adding AnomalyStatus).
  • Schema Changes: Modifications to the table schema (e.g., adding FraudFlag column).
  • Versioning: Changes in table versions due to updates and restores.

Benefits:

  • Transparency: Clear understanding of how data is processed and transformed.
  • Auditing: Ability to trace data lineage for compliance and auditing purposes.
  • Debugging: Easier identification of issues in data processing pipelines.
  • Optimization: Insights into data flow can help optimize performance.
ilum data lineage

Conclusion

This use case demonstrates how to perform financial anomaly detection using Spark SQL and Delta Lake, incorporating key features like time travel, change data feed, and schema evolution. By integrating with OpenLineage, the solution provides comprehensive data lineage tracking, enhancing transparency and compliance.

Key Takeaways:

  • Effective Anomaly Detection: Utilizing statistical methods to identify transactions and accounts that deviate from normal patterns.
  • Robust Data Management: Leveraging Delta Lake features to ensure data integrity, facilitate schema changes, and enable recovery options.
  • Data Lineage Visibility: Capturing detailed lineage information to support auditing and compliance requirements.
  • Business Insights: Generating reports and analyses that provide valuable insights into customer behaviors and potential risks.

Appendix

Dataset Schema

transactions Table Schema:

ColumnData TypeDescription
TransactionTimestampTIMESTAMPDate and time when the transaction occurred
TransactionIDSTRINGUnique identifier for the transaction
AccountIDSTRINGUnique identifier for the account involved
AmountDECIMAL(18,2)Monetary value of the transaction
MerchantSTRINGMerchant involved in the transaction
TransactionTypeSTRINGType of the transaction (e.g., Purchase, Withdrawal)
LocationSTRINGLocation where the transaction occurred
FraudFlagBOOLEANFlag indicating suspected fraud (added later)

References: