Financial Transaction Anomaly Detection and Data Lineage with Delta Lake
Introduction
In the financial industry, detecting anomalies in transaction data is critical for fraud prevention, compliance, and risk management. Unusual transaction amounts, irregular transaction types, and atypical patterns can indicate fraudulent activities or errors that require immediate attention. Leveraging big data technologies and advanced data processing frameworks can enhance the ability to detect such anomalies effectively.

This example presents a simplified use case that demonstrates how to perform financial transaction anomaly detection using Spark SQL and Delta Lake. The solution also integrates with the OpenLineage framework to capture detailed data lineage information throughout the data processing pipeline.
The same thing can be accomplished with the use of Iceberg or Hudi. We chose Delta because it's a default table format in Ilum.
BI dashboard is done with the use of Apache Superset, which is one of the modules available in Ilum. You can find the dashboard for this use case here
Dataset Description
The dataset used in this use case consists of simulated financial transaction records with the following columns:
- TransactionTimestamp: Timestamp of when the transaction occurred.
- TransactionID: Unique identifier for each transaction.
- AccountID: Unique identifier for the bank account involved.
- Amount: Monetary value of the transaction.
- Merchant: The merchant or entity involved in the transaction.
- TransactionType: Type of transaction (e.g., Purchase, Withdrawal, Transfer).
- Location: Geographical location where the transaction took place.
Sample Data:
TransactionTimestamp | TransactionID | AccountID | Amount | Merchant | TransactionType | Location |
---|---|---|---|---|---|---|
2023-01-01 08:00:00 | TXN1127 | ACC4 | 95071.92 | MerchantH | Purchase | Tokyo |
2023-01-01 08:01:00 | TXN1639 | ACC10 | 15607.89 | MerchantH | Purchase | London |
2023-01-01 08:02:00 | TXN872 | ACC8 | 65092.34 | MerchantE | Withdrawal | London |
2023-01-01 08:03:00 | TXN2001 | ACC4 | 500.50 | MerchantA | Transfer | New York |
2023-01-01 08:04:00 | TXN2002 | ACC10 | 1200.00 | MerchantB | Purchase | Paris |
2023-01-01 08:05:00 | TXN2003 | ACC8 | 300.00 | MerchantC | Withdrawal | Tokyo |
Objectives
The primary objectives of this use case are:
- Anomaly Detection: Identify transactions that are anomalies based on the amount exceeding an upper limit derived from the data distribution.
- Irregular Transaction Identification: Detect irregular transaction types for specific accounts.
- Pattern Recognition: Recognize unusual patterns based on transaction timestamps and locations.
- Behavioral Analysis: Track spending behaviors based on merchants.
- Statistical Summaries: Compute total transactions, average, minimum, and maximum transaction amounts per account.
- Delta Lake Integration: Utilize Delta Lake features such as time travel, change data feed, and schema evolution to enhance data processing.
- Data Lineage Tracking: Capture and visualize data lineage using the OpenLineage framework.
Solution Overview
The solution involves creating a series of Delta Lake tables to process and analyze the transaction data. The steps include:
- Data Ingestion: Load the transaction data into a Delta Lake table.
- Statistical Analysis: Calculate statistical parameters to define anomaly thresholds.
- Anomaly Detection: Identify transactions and accounts that exceed the defined limits.
- Pattern Analysis: Detect irregular transaction types and temporal patterns.
- Behavioral Analysis: Analyze spending behaviors based on merchants.
- Data Enhancement: Use Delta Lake features for data updates, schema evolution, and time travel.
- Reporting: Generate summary reports for business insights.
- Data Lineage: Leverage OpenLineage to capture data lineage throughout the pipeline.
Technical Implementation
Step 1: Create the Transactions Delta Table
We start by creating the transactions
table as a Delta Lake table to store the financial transaction data.
SQL Code:
CREATE TABLE transactions (
TransactionTimestamp TIMESTAMP COMMENT 'Date and time when the transaction occurred',
TransactionID STRING COMMENT 'Unique identifier for the transaction',
AccountID STRING COMMENT 'Unique identifier for the account involved',
Amount DECIMAL(18, 2) COMMENT 'Monetary value of the transaction',
Merchant STRING COMMENT 'Merchant involved in the transaction',
TransactionType STRING COMMENT 'Type of the transaction (e.g., Purchase, Withdrawal)',
Location STRING COMMENT 'Location where the transaction occurred'
)
USING DELTA
COMMENT 'Table containing financial transactions';
Insert Sample Data:
- Import data from file
- Insert data manually
Link to the example file can be found here
It is necessary to upload the file to the storage linked to your Ilum instance.
INSERT OVERWRITE TABLE transactions
SELECT
_c0,_c1,_c2,_c3,_c4,_c5,_c6
FROM csv.`s3a://ilum-data/financial_anomaly_data.csv`
INSERT INTO transactions VALUES
(TIMESTAMP('2023-01-01 08:00:00'), 'TXN1127', 'ACC4', 95071.92, 'MerchantH', 'Purchase', 'Tokyo'),
(TIMESTAMP('2023-01-01 08:01:00'), 'TXN1639', 'ACC10', 15607.89, 'MerchantH', 'Purchase', 'London'),
(TIMESTAMP('2023-01-01 08:02:00'), 'TXN872', 'ACC8', 65092.34, 'MerchantE', 'Withdrawal', 'London'),
(TIMESTAMP('2023-01-01 08:03:00'), 'TXN2001', 'ACC4', 500.50, 'MerchantA', 'Transfer', 'New York'),
(TIMESTAMP('2023-01-01 08:04:00'), 'TXN2002', 'ACC10', 1200.00, 'MerchantB', 'Purchase', 'Paris'),
(TIMESTAMP('2023-01-01 08:05:00'), 'TXN2003', 'ACC8', 300.00, 'MerchantC', 'Withdrawal', 'Tokyo'),
(TIMESTAMP('2023-01-01 09:15:00'), 'TXN2004', 'ACC5', 25000.00, 'MerchantD', 'Deposit', 'Berlin'),
(TIMESTAMP('2023-01-01 10:30:00'), 'TXN2005', 'ACC6', 75000.00, 'MerchantE', 'Purchase', 'Sydney'),
(TIMESTAMP('2023-01-01 11:45:00'), 'TXN2006', 'ACC7', 500.00, 'MerchantF', 'Withdrawal', 'Toronto'),
(TIMESTAMP('2023-01-01 12:00:00'), 'TXN2007', 'ACC4', 300000.00, 'MerchantG', 'Transfer', 'Dubai'),
(TIMESTAMP('2023-01-01 13:15:00'), 'TXN2008', 'ACC8', 150.00, 'MerchantH', 'Purchase', 'Tokyo'),
(TIMESTAMP('2023-01-01 14:30:00'), 'TXN2009', 'ACC10', 2000.00, 'MerchantI', 'Deposit', 'New York'),
(TIMESTAMP('2023-01-01 15:45:00'), 'TXN2010', 'ACC5', 800.00, 'MerchantJ', 'Withdrawal', 'London'),
(TIMESTAMP('2023-01-01 16:00:00'), 'TXN2011', 'ACC6', 45000.00, 'MerchantK', 'Purchase', 'Paris'),
(TIMESTAMP('2023-01-01 17:15:00'), 'TXN2012', 'ACC7', 600.00, 'MerchantL', 'Transfer', 'Berlin'),
(TIMESTAMP('2023-01-01 18:30:00'), 'TXN2013', 'ACC4', 70000.00, 'MerchantM', 'Purchase', 'Tokyo'),
(TIMESTAMP('2023-01-01 19:45:00'), 'TXN2014', 'ACC8', 500000.00, 'MerchantN', 'Withdrawal', 'Sydney'),
(TIMESTAMP('2023-01-01 20:00:00'), 'TXN2015', 'ACC10', 120000.00, 'MerchantO', 'Purchase', 'Dubai'),
(TIMESTAMP('2023-01-01 21:15:00'), 'TXN2016', 'ACC5', 350.00, 'MerchantP', 'Deposit', 'Toronto'),
(TIMESTAMP('2023-01-01 22:30:00'), 'TXN2017', 'ACC6', 1000.00, 'MerchantQ', 'Withdrawal', 'New York'),
(TIMESTAMP('2023-01-01 23:45:00'), 'TXN2018', 'ACC7', 90000.00, 'MerchantR', 'Transfer', 'Paris'),
(TIMESTAMP('2023-01-02 00:00:00'), 'TXN2019', 'ACC4', 250.00, 'MerchantS', 'Purchase', 'London'),
(TIMESTAMP('2023-01-02 01:15:00'), 'TXN2020', 'ACC8', 400.00, 'MerchantT', 'Withdrawal', 'Tokyo'),
(TIMESTAMP('2023-01-02 02:30:00'), 'TXN2021', 'ACC10', 5000.00, 'MerchantU', 'Deposit', 'Berlin'),
(TIMESTAMP('2023-01-02 03:45:00'), 'TXN2022', 'ACC5', 600000.00, 'MerchantV', 'Purchase', 'Dubai'),
(TIMESTAMP('2023-01-02 04:00:00'), 'TXN2023', 'ACC6', 50.00, 'MerchantW', 'Transfer', 'Sydney'),
(TIMESTAMP('2023-01-02 05:15:00'), 'TXN2024', 'ACC7', 800.00, 'MerchantX', 'Purchase', 'New York'),
(TIMESTAMP('2023-01-02 06:30:00'), 'TXN2025', 'ACC4', 1000000.00, 'MerchantY', 'Withdrawal', 'Tokyo'),
(TIMESTAMP('2023-01-02 07:45:00'), 'TXN2026', 'ACC8', 100.00, 'MerchantZ', 'Deposit', 'London');
Explanation:
- USING DELTA: Specifies that the table is a Delta Lake table.
- COMMENTS: Provide metadata for better documentation and understanding.
Sample Output:

Sample Visualization:

Step 2: Detecting Anomalies in Transaction Amounts
We aim to identify transactions where the amount exceeds an upper limit based on the distribution of the Amount
column.
2.1 Calculate Statistical Parameters
SQL Code:
CREATE VIEW amount_stats AS
SELECT
AVG(Amount) AS avg_amount,
STDDEV(Amount) AS stddev_amount,
PERCENTILE(Amount, 0.95) AS percentile_95
FROM
transactions;
Explanation:
- AVG(Amount): Calculates the average transaction amount.
- STDDEV(Amount): Calculates the standard deviation, useful for understanding data dispersion.
- PERCENTILE(Amount, 0.95): Computes the 95th percentile, used as the upper limit for normal transactions.
Sample Output:

2.2 Identify Transactions Exceeding the Limit
SQL Code:
CREATE VIEW anomalies
AS
SELECT
TransactionID,
AccountID,
Amount,
CASE
WHEN Amount > (SELECT percentile_95 FROM amount_stats) THEN 'ExceedsLimit'
ELSE 'WithinLimit'
END AS AnomalyStatus
FROM
transactions;
Explanation:
- AnomalyStatus: A new column indicating whether the transaction exceeds the limit.
Sample Output:
TransactionID | AccountID | Amount | AnomalyStatus |
---|---|---|---|
TXN1127 | ACC4 | 95071.92 | ExceedsLimit |
... | ... | ... | ... |

Sample Visualization:

Step 3: Calculating Account Statistics
Compute total transactions, average, minimum, and maximum transaction amounts per account.
SQL Code:
CREATE TABLE account_totals
USING DELTA
AS
SELECT
AccountID,
COUNT(TransactionID) AS TransactionCount,
SUM(Amount) AS TotalAmount,
AVG(Amount) AS AverageAmount,
MIN(Amount) AS MinAmount,
MAX(Amount) AS MaxAmount
FROM
transactions
GROUP BY
AccountID;
Explanation:
- GROUP BY AccountID: Aggregates data per account.
- Aggregate Functions: Compute statistical measures for each account.
Sample Output:
AccountID | TransactionCount | TotalAmount | AverageAmount | MinAmount | MaxAmount |
---|---|---|---|---|---|
ACC4 | 2 | 95572.42 | 47786.21 | 500.50 | 95071.92 |
... | ... | ... | ... | ... | ... |

Step 4: Identifying Accounts Exceeding Limits
Identify accounts that have transactions exceeding the predefined limit.
4.1 Determine Exceeding Accounts
SQL Code:
CREATE VIEW exceeding_accounts AS
SELECT DISTINCT AccountID
FROM anomalies
WHERE AnomalyStatus = 'ExceedsLimit';
4.2 Count Exceeding and Non-Exceeding Accounts
SQL Code:
SELECT
(SELECT COUNT(*) FROM exceeding_accounts) AS ExceedingAccounts,
(SELECT COUNT(DISTINCT AccountID) FROM transactions) - (SELECT COUNT(*) FROM exceeding_accounts) AS NonExceedingAccounts;
Sample Output:
ExceedingAccounts | NonExceedingAccounts |
---|---|
12 | 3 |

Step 5: Recognizing Unusual Temporal Patterns
Detect transactions occurring outside of normal business hours (e.g., 9 AM to 5 PM).
SQL Code:
CREATE VIEW after_hours_transactions
AS
SELECT
*,
HOUR(TransactionTimestamp) AS TransactionHour
FROM
transactions
WHERE
HOUR(TransactionTimestamp) NOT BETWEEN 9 AND 17;
Explanation:
- HOUR(TransactionTimestamp): Extracts the hour from the timestamp.
- NOT BETWEEN 9 AND 17: Filters transactions outside business hours.
Step 7: Analyzing Spending Behaviors by Merchant
Aggregate spending data to understand customer behaviors based on merchant interactions.
SQL Code:
CREATE TABLE merchant_spending
USING DELTA
AS
SELECT
Merchant,
COUNT(*) AS TransactionCount,
SUM(Amount) AS TotalSpent,
AVG(Amount) AS AverageTransactionAmount,
MIN(Amount) AS MinAmount,
MAX(Amount) AS MaxAmount
FROM
transactions
GROUP BY
Merchant;
Sample Output:
Merchant | TransactionCount | TotalSpent | AverageTransactionAmount | MinAmount | MaxAmount |
---|---|---|---|---|---|
MerchantH | 2 | 110679.81 | 55339.91 | 15607.89 | 95071.92 |
... | ... | ... | ... | ... | ... |

Sample Visualization:

Step 8: Leveraging Delta Lake Features
8.1 Enable Change Data Feed
Enable Delta Lake's change data feed to track changes in the transactions
table.
SQL Code:
ALTER TABLE transactions
SET TBLPROPERTIES (delta.enableChangeDataFeed = true);
8.2 Perform Updates and Track Changes
Update transaction amounts due to corrections and track these changes.
SQL Code:
-- Update a transaction
UPDATE transactions
SET Amount = 1300.00
WHERE TransactionID = 'TXN1127';
-- Check the last version
DESCRIBE HISTORY transactions;

-- Query changes using change data feed with proper table version
SELECT *
FROM table_changes('transactions', 3)
WHERE TransactionID = 'TXN1127';

8.3 Schema Evolution: Add New Column
Add a new column FraudFlag
to flag suspected fraudulent transactions.
SQL Code:
ALTER TABLE transactions
ADD COLUMNS (FraudFlag BOOLEAN COMMENT 'Flag indicating suspected fraud');
8.4 Update Fraudulent Transactions
Flag transactions exceeding the upper limit.
SQL Code:
UPDATE transactions
SET FraudFlag = TRUE
WHERE Amount > 95071.92;

8.5 Time Travel and Table Restore
View Table History
SQL Code:
DESCRIBE HISTORY transactions;

Query a Previous Version
SQL Code:
SELECT * FROM transactions VERSION AS OF 3;
Restore Table to a Previous Version
SQL Code:
RESTORE TABLE transactions TO VERSION AS OF 3;
Explanation:
- Time Travel: Access previous versions of the data for auditing or recovery.
- Table Restore: Revert the table to a known good state after unintended changes.
Step 9: Generating Final Summary Reports
Generate reports summarizing accounts exceeding limits and compute averages.
SQL Code:
CREATE TABLE account_exceeding_summary
USING DELTA
AS
SELECT
a.AccountID,
a.TransactionCount,
a.TotalAmount,
a.AverageAmount,
a.MinAmount,
a.MaxAmount,
CASE
WHEN e.AccountID IS NOT NULL THEN 'ExceedsLimit'
ELSE 'WithinLimit'
END AS ExceedingStatus
FROM
account_totals a
LEFT JOIN
exceeding_accounts e ON a.AccountID = e.AccountID;

-- Compute averages for exceeding and non-exceeding accounts
SELECT
ExceedingStatus,
COUNT(AccountID) AS AccountCount,
AVG(TotalAmount) AS AverageTotalAmount,
AVG(AverageAmount) AS AverageTransactionAmount
FROM
account_exceeding_summary
GROUP BY
ExceedingStatus;

Delta Lake Features Utilized
- Delta Tables: All tables are created using Delta Lake to leverage ACID transactions and schema enforcement.
- Time Travel: Access previous versions of data for auditing and recovery.
- Table Restore: Restore tables to a previous state if necessary.
- Change Data Feed: Track changes to data for incremental processing and auditing.
- Schema Evolution: Add new columns (
FraudFlag
) without disrupting existing data. - Constraints and Checks: Enforce data integrity through schema definitions.
- Partitioning: Although not explicitly used here, Delta Lake supports partitioning for performance optimization.
Data Lineage and OpenLineage Integration
By executing these SQL operations in an environment integrated with OpenLineage, each operation generates lineage metadata. This metadata captures:
- Data Flow: How data moves from one table to another (e.g., from
transactions
toanomalies
). - Transformations: The specific transformations applied to the data (e.g., adding
AnomalyStatus
). - Schema Changes: Modifications to the table schema (e.g., adding
FraudFlag
column). - Versioning: Changes in table versions due to updates and restores.
Benefits:
- Transparency: Clear understanding of how data is processed and transformed.
- Auditing: Ability to trace data lineage for compliance and auditing purposes.
- Debugging: Easier identification of issues in data processing pipelines.
- Optimization: Insights into data flow can help optimize performance.

Conclusion
This use case demonstrates how to perform financial anomaly detection using Spark SQL and Delta Lake, incorporating key features like time travel, change data feed, and schema evolution. By integrating with OpenLineage, the solution provides comprehensive data lineage tracking, enhancing transparency and compliance.
Key Takeaways:
- Effective Anomaly Detection: Utilizing statistical methods to identify transactions and accounts that deviate from normal patterns.
- Robust Data Management: Leveraging Delta Lake features to ensure data integrity, facilitate schema changes, and enable recovery options.
- Data Lineage Visibility: Capturing detailed lineage information to support auditing and compliance requirements.
- Business Insights: Generating reports and analyses that provide valuable insights into customer behaviors and potential risks.
Appendix
Dataset Schema
transactions Table Schema:
Column | Data Type | Description |
---|---|---|
TransactionTimestamp | TIMESTAMP | Date and time when the transaction occurred |
TransactionID | STRING | Unique identifier for the transaction |
AccountID | STRING | Unique identifier for the account involved |
Amount | DECIMAL(18,2) | Monetary value of the transaction |
Merchant | STRING | Merchant involved in the transaction |
TransactionType | STRING | Type of the transaction (e.g., Purchase, Withdrawal) |
Location | STRING | Location where the transaction occurred |
FraudFlag | BOOLEAN | Flag indicating suspected fraud (added later) |
References: