Excuse me, do you speak fraud?
Network graph analysis for fraud detection and mitigation
Network analysis offers a new set of techniques to tackle the persistent and growing problem of complex fraud. Network analysis supplements traditional techniques by providing a mechanism to bridge investigative and analytics methods. Beyond base visualization, network analysis provides a standardized platform for complex fraud pattern storage and retrieval, pattern discovery and detection, statistical analysis, and risk scoring. This article gives an overview of the main challenges and demonstrates a promising approach using a hands-on example.
Understanding the problem of fraud detection
With swelling globalization, advanced digital communication technology, and international financial deregulation, fraud investigators face a daunting battle against increasingly sophisticated fraudsters. Fraud is estimated to encompass 5% of the global economy, resulting in an annual loss of more than €2.3 trillion. Further, indications are that fraud is growing in volume, scope, and sophistication.
In an increasingly global and virtual world, the methods for perpetrating fraud are growing in sophistication. As well, fraudsters are increasingly able to collaborate in international rings to perpetrate their schemes and to distribute ill-acquired gains.
A complication in the effort to detect and mitigate complex cases of fraud is the difficulty of smoothly bridging the worlds of forensic investigation and data analytics. Fraud investigators, with deep domain knowledge and street smarts, plough through complex documents, interview parties of interest, and spend time understanding the arcane schemes by which fraudsters attempt to avoid detection.
However, the growing scale of complex fraud means that investigators are increasingly being overwhelmed by volume. Additionally, fraud specialists have deep knowledge of complex fraud cases – tacit knowledge – yet it is difficult to make this knowledge explicit such that it is suitable for efficient sharing. This is especially the case in terms of difficulties in describing complex fraud in terms of patterns suitable for systems-driven detection and analysis.
Meanwhile, data analytics experts gather, transform, and analyse datasets for possible fraud, addressing the challenges of scale and volume via ‘big data’ approaches. Statistical techniques for detecting outliers and algorithmic techniques for identifying suspicious patterns are applied, machine learning for example.
However, fraud detection via advanced analytics typically depends on structured datasets and structured data models. As a result, it is rare that exhaustive datasets are available which encompass all the domains surrounding complex cases of fraud.
As well, machine learning and data mining methods are primarily ‘supervised’, meaning they require training datasets which contain known fraud cases. Sophisticated fraudsters often are knowledgeable concerning automated detection methods and take pains to evade such detection. As a result, complex and ‘innovative’ types of fraud potentially circumvent automated detection when the methods avoid upsetting standard processes (i.e. they leave a seemingly ‘normal’ data trail).
- Want to learn more? Blog posting on the challenges of automated fraud detection http://tinyurl.com/kkrg8ye
Network analytics for fraud detection
Network analytics is a powerful tool to amplify traditional fraud investigative approaches – a method for cataloguing known, detecting hidden, and discovering new types of fraud. What is principally lacking in the disconnection between the forensics world and the world of data analytics is a transparent, standard language for communicating and searching for complex fraud patterns.
The fraud investigative world deals in rich details and confronts constantly emerging and evolving techniques. The computational world typically communicates in highly structured, abstract datasets and applies analysis via structured datasets. Datasets are often limited in scope and relational database models are slow to accommodate rapidly evolving schemes. Somewhere along the line, the rich complexities of fraud schemes evade both hands-on and automated detection.
This is where network graph analysis is of central value– it offers a method for capturing the rich context of fraud in a standard, machine readable and transferable format. Once captured in such a format, deep pattern and statistical analysis can be conducted on existing datasets. Network analytics is thus a complementary approach which enhances and bridges fraud investigatory and data analytics approaches.
- Want to learn more? Brief overview and demonstration of network analytics http://tinyurl.com/p3pbllx
The schemes to dodge or exploit taxes are manifold and range from simple to labyrinthine. In particular enterprise and institutions suffer when complex schemes are systematized at a high-volume or involve transactions in high amounts. Sophisticated fraudsters operating at this scale often operate in rings and across borders.
Case in point – EU VAT fraud
As an example, particular markets in the EU are susceptible to cross-border fraud schemes whereby participants seek to avoid value-added-tax (VAT) charges and exploit national tax credits. The amounts are substantial, with some EU countries foregoing or improperly crediting VAT charges of as high as 25%. Avoidance and claiming improper credits together systemically cut into national tax revenues across the EU.
- Want to learn more? Recent article on analytics for tax fraud detection
Fraudsters are savvy in targeting particular markets and borders, often operating via complex sets of cross-border holding companies and ownership structures. Emerging, unregulated, and highly dynamic markets are particularly at risk, such as those associated with emerging or high-volume specialized commodities. As well, markets which deal in tradable rights or other intangibles are at risk, as they do not leave a physical trail (i.e. lack of witnesses, shipping records, and storage manifests).
Via native network data analysis, such complex fraud schemes can be described in both their general and specific manifestations. As an example, a recognized VAT fraud involves trading international telecommunication rights (the exchange of rights to telecommunication service). The pattern of a particular scheme was translated into a network format and stored in a ‘graph database’ (a native database for storing, managing, and retrieving networked data):
Figure 1: Cross-border EU value-added-tax fraud scheme involving a missing trader and tax credit abuse as encoded in a standard network format with countries denoted (names fictionalized)
The scheme can be summarized succinctly as thus:
- Southern Europa Telco (3-) buys U.S. phone card rights from two U.S. companies (1- and 2-),
- Southern Europa Telco (3-) re-sells within Italy to joint Bridge Co. (4-) and collects VAT,
- Southern Europa Telco (3-) does not pay VAT to Italian tax authorities, instead disappearing with the VAT and becoming a missing trader,
- Joint Bridge Co. (4-) resells to Swift Co. (6-) within Italy via parent company Joint IT Group (A-),
- Swift Co. (6-) pays VAT to Joint IT Group (A-),
- Swift Co. (6-) sells across border to UK Chips Trading Ltd. (7-) and U.S. Nexus Global US Ltd. (11-),
- Swift Co. claims VAT credit from Italian tax authorities to offset other international business activities,
- Chips Trading Ltd. firm sells to Strand VI Co. (9-) in Virgin Islands via sister firm Chips Global (8-) within Chips UK Group (B-) – this allows Chips UK Group (B-) to claim VAT neutrality,
- Strand VI Co. in the Virgin Islands becomes the final recipient of the phone card rights, which can then be recycled to the U.S. Presumably a back-door mechanism exists within the Virgin Islands for participants to share in the benefits: VAT appropriated by missing trader and Italian VAT tax credits.
Recognized schemes, often the result of an intensive fraud investigation, can thus be encoded using a standard format. The pattern can then be used to detect similar transactions in large datasets. However, the Italian national tax authority, absent full details from foreign tax authorities, likely only has insight into a reduced transactional view of this scheme. Namely, only initial transactions across the border and within Italy are likely visible:
Figure 2: Cross-border EU VAT tax fraud scheme from the perspective of Italian tax authorities
In this manifestation, it becomes difficult for the Italian tax authorities to apply traditional automated data analytics detection methods (i.e. data mining or machine learning). However, by having documented the full VAT fraud scheme in a network format, characteristic details of the fraud can also be documented. In particular, several unusual aspects of the Italian companies were resident (and can be stored) in the full fraud pattern documented previously:
- Transience of the missing trader: the chief earmark of this fraud pattern involves indications that the missing trader is a ‘front’ – a company set-up quickly with the intention of disappearing quickly. Data from the Chamber of Commerce and tax office concerning the inception date of the company may indicate that it is close to the initial purchase transaction, triggering an alert. As well, upon a warning, forensics investigators can examine additional details to substantiate the company as being ‘at risk’ – for instance, a false on non-answering phone number, an unoccupied address, and/or a ‘fake’ website.
- Velocity: for the fraud to operate at a low risk of detection, the entire transaction is likely completed in a relatively compressed period of time (ideally before the missing trader is detected by the tax office) – the short time-span (based on date signatures on the transactions in the data) can be calculated and detected,
- Position of the missing trader: the missing trader is the initial purchaser at the border – the entire rapid transaction chain (as per b) exiting the country in three steps could be used to trigger an alert to immediately check the validity of the initial purchaser, as per a.
- Volume and/or scale: for the fraud to be commercially viable, it needs to be conducted either at great volume or scale – indications of multiple transaction chains along the same path in a short time period and/or large transactions are potential alerts to check a.
- Additional data: company ownership by citizens (national citizen number) can be layered onto network data – citizens with ownership stakes in two or more companies in the transaction chain would be considered suspicious, for instance, and
- Third-party data: data from the police, banks, and credit agencies can be layered onto the network data to identify individuals and companies with a high-risk for fraud and resulting scores can be used in aggregate to rate a transaction chain as high risk!
Working with the Neo4J graph database, we can encode such a fraud scheme pattern via a Cypher statement. This pattern represents an approximation of the limited set of transactions visible to the Italian authorities:
CREATE (CO1)-[:SELLS_TO{date: '41548', item_type: 'phone cards rights', epoch: 1380617873, amt: '10000000'}]->(CO3)
CREATE (CO2)-[:SELLS_TO{date: '41548', item_type: 'phone cards rights', epoch: 1380617873, amt: '15000000'}]->(CO3)
CREATE (CO3)-[:SELLS_TO{date: '41557', item_type: 'phone cards rights', epoch: 1381395473, amt: '25000000'}]->(CO4)
CREATE (CO12)-[:SELLS_TO{date: '41562', item_type: 'phone cards rights', epoch: 1381827473, amt: '25000000'}]->(CO6)
CREATE (CO6)-[:SELLS_TO{date: '41567', item_type: 'phone cards rights', epoch: 1382259473, amt: '25000000'}]->(CO7)
CREATE (CO6)-[:SELLS_TO{date: '41572', item_type: 'phone cards rights', epoch: 1382691473, amt: '25000000'}]->(CO11)
CREATE (CO8)-[:SELLS_TO{date: '41577', item_type: 'phone cards rights', epoch: 1383123473, amt: '25000000'}]->(CO9)
CREATE (CO3)-[:COLLECTS_VAT{date: '41557', item_type: 'VAT paid', epoch: 1381395473, amt: '10000000'}]->(CO4)
CREATE (CO12)-[:COLLECTS_VAT{date: '41562', item_type: 'VAT paid', epoch: 1381827473, amt: '10000000'}]->(CO6)
CREATE (CO12)-[:PARENT_OF{date: '', item_type: 'parent company'}]->(CO4)
CREATE (CO12)-[:PARENT_OF{date: '', item_type: 'parent company'}]->(CO5)
CREATE (CO13)-[:PARENT_OF{date: '', item_type: 'parent company'}]->(CO7)
CREATE (CO13)-[:PARENT_OF{date: '', item_type: 'parent company'}]->(CO8)
CREATE (CO14)-[:PARENT_OF{date: '', item_type: 'parent company'}]->(CO10)
CREATE (CO14)-[:PARENT_OF{date: '', item_type: 'parent company'}]->(CO11)
CREATE (P01)-[:DIRECTOR_OF{date: '', item_type: 'director'}]->(CO1)
CREATE (P02)-[:DIRECTOR_OF{date: '', item_type: 'director'}]->(CO2)
CREATE (P03)-[:DIRECTOR_OF{date: '', item_type: 'director'}]->(CO3)
CREATE (P04)-[:DIRECTOR_OF{date: '', item_type: 'director'}]->(CO4)
CREATE (P05)-[:DIRECTOR_OF{date: '', item_type: 'director'}]->(CO5)
CREATE (P06)-[:DIRECTOR_OF{date: '', item_type: 'director'}]->(CO6)
CREATE (P07)-[:DIRECTOR_OF{date: '', item_type: 'director'}]->(CO7)
CREATE (P08)-[:DIRECTOR_OF{date: '', item_type: 'director'}]->(CO8)
CREATE (P09)-[:DIRECTOR_OF{date: '', item_type: 'director'}]->(CO9)
CREATE (P10)-[:DIRECTOR_OF{date: '', item_type: 'director'}]->(CO10)
CREATE (P02)-[:DIRECTOR_OF{date: '', item_type: 'director'}]->(CO11)
CREATE (P04)-[:DIRECTOR_OF{date: '', item_type: 'director'}]->(CO12)
CREATE (P11)-[:DIRECTOR_OF{date: '', item_type: 'director'}]->(CO13)
CREATE (P12)-[:DIRECTOR_OF{date: '', item_type: 'director'}]->(CO14)
Given this pattern and knowing the tell-tale aspects of the fraud, query can be developed which will identify a similar pattern in a large set of transactional data. In this example, we would like to identify any sets of cross-border telecommunication rights trades occurring over a short period of time (i.e. less than 15 days) and whereby an intermediary company in the chain of transactions is quite new (i.e. less than 90 days old).
Working with Cypher, we can query a large Neo4J dataset for this specific pattern in tax transactions (thanks to Jean Villedieu of linkurio.us for the query design):
MATCH p=(a:Company)-[rs:SELLS_TO*]->(c:Company)
WHERE a.country <> c.country
WITH p, a, c, rs, nodes(p) AS ns
WITH p, a, c, rs, filter(n IN ns WHERE n.epoch – 1383123473 < (90*60*60*24)) AS bs
WITH p, a, c, rs, head(bs) AS b
WHERE NOT b IS NULL
WITH p, a, b, c, head(rs) AS r1, last(rs) AS rn
WITH p, a, b, c, r1, rn, rn.epoch – r1.epoch AS d
WHERE d < (15*60*60*24)
RETURN a, b, c, d, r1, rn
To summarize, having identified the full fraud pattern, we abstracted a version limited to data available to the Italian tax authorities. These details were used to design a specific query which can then identify the fraud pattern in a large set of tax transaction data.
The full fraud pattern, stored in a graph database ‘fraud library’ in an annotated, network-descriptive format, gives tell-tale indications for detection in the smaller pattern-set available to the national tax authorities. This then supports detection in a large set of national tax data.
Beyond visualization: statistical measures
The value of storing fraud schemes as standard patterns in a network format (in a graph database) can be summarized as:
- standardization without sacrificing detail,
- ability to communicate patterns between systems transparently,
- ability to amplify patterns with additional data, and
- ability to run dynamic network queries on ‘big data’ sets.
However, an additional benefit exists – the ability to characterize statistical measures to empower the discovery of new patterns and automatic pattern detection.
Network science and graph analysis encompasses rich, existing fields of study which specify and study reoccurring patterns and quantitative aspects of networks. Likewise, the social sciences have adopted these principals to study social phenomenon via social network analysis (SNA).
- Want to learn more? What measures does social network analysis provide?
Together, these domains observe that all network structures have common patterns, and that these patterns can be studied and quantified. Networks can be measured in terms of hard measures such as reach, clustering or modularity, centrality, and dispersion. Transactions entail steps across a network, and these steps can be scored in terms of ‘weight’, for instance in terms of volume, frequency, speed (over time), amount (monetary), or risk (i.e. in terms of credit risk). Additionally, individuals and companies can be assessed in terms of their relative positions and interactions in a network.
- Want to learn more? Network analytics: more than pretty pictures
As an example, returning to the VAT fraud example, national tax offices have data concerning company cross-ownership and the association of citizens (via national identification numbers). These details can be used to assess the association of known fraudsters or high-risk individuals with others. Thus, a seemingly ‘clean’ company or individual which transacts frequently or in a high amounts between two high-risk entities could be flagged in terms of participating in at-risk transactions. The results can then be used to enhance traditional machine learning detection methods.
Figure 3: Utilizing networked data to establish risk scores for transactions and company associations, which can be used to enhance machine learning approaches
Summary conclusion
The native storage of fraud patterns as network phenomenon, and the application of these patterns to fraud detection is a powerful technique. This approach allows for the composition of ‘fraud libraries’ to capture rich details concerning schemes. Once encoded, tell-tale features of the fraud can be identified to give investigators indications of where to focus automated detection efforts. Additionally, storing and analysing network data leads to new types of indicators via network analysis: statistical measures and the ability to ‘score’ transactions and associations for aggregate risk. At the cutting edge, data on networks can be examined and simulated over time to gain new insights into how markets and transactions are evolving in character – a foundation for strategy formation and proactive preparedness.
