libMEV - Data Methodologies

Published: November 19, 2023 data

Searchers, like artists, craft code with the strategic grace of a chessmaster, shaping sequences of logic in the digital realm, eternally etched in the blockchain.

Unfortunately deciphering these logic puzzles (bundles) in a generalizable way isn't a easy task. This is because we are essentially reverse engineering a complex puzzle, akin to identifying cake ingredients after being presented with a baked cake.

In this blogpost, we will be going through some of the data methodologies that we use in libMEV so everyone can be a MEV detective.

Prelude

libMEV is primarily designed to provide accurate MEV data, although it may not capture all the data, we are confident the data it does capture is accurate and reliable. Should you find any discrepancies, please alert us on our telegram.

Acknowlegements

We would like to thank the paradigm team. Cryo and reth is heavily used in our pipeline, and libMEV would not have been possible without those tools.

Data Collection

Before we can process the data, we first need to harvest the data. Using cryo, we extract the data in chunks of 1000 blocks. The type of data we're collecting are:

block
txs
events
balance_diffs
storage_diffs
eth_call (used for custom stuff, e.g. on-chain pricing)

We then store the data as-is into a blob storage, to be used for processing later on.

For the node, we are also using a vanilla version of reth. This is important later on as reth is one of the few EL nodes that has a specific feature we need in the post processing pipeline.

Data Processing

For the processing pipeline, we're using a custom implementation built from the ground up to work seamlessly with Apache Arrow (default output from cryo), and NOT JSON dumps from RPC calls.

This allows us to process the data much quicker, and at scale as we avoid the need of re-querying the raw data from the EL node.

Bundle? Not Bundle?

Just so everyone is on the same page: A bundle is a list of transactions that are in a particular order from the searcher.

source: flashbots

Searchers want the bundles to be executed in the order they are sent, as they are only able to extract value if done so.

For a bundle to be identified in our system, it needs to be in one of the following bundle types, each with their own unique set of properties:

Independent bundles
Backrunning bundles
Frontrunning and backrunning bundles

Independent Bundles

Independent bundles are simple, they're self-contained transactions and come mostly in the form of top-of-block arbitrages or liquidations, where there was leftover opportunity from the previous block, and do not rely on any prior transactions within the same block.

For liquidations, its very straightforward, as long as a liquidation event is emitted from a known borrowing protocol address (e.g. Aave, Compound), we can label it as so.

The challenge lies in distinguishing between an arbitrage and a regular token swap. Can you tell which is which below?

The transaction on the right stands out as an arbitrage. It doesn't ultilize the OneInch aggregator, whereas the left one does. Setting aside these explicit labels, how can we programatically determine if a transaction is an arbitrage or a regular token swap?

Lets revisit the definition of an arbitrage:

Note the key sentence: "...simulationeously buy and sell an asset..."

So, as long as the following properties hold true, we can confidently label it as an arbitrage:

to_address is NOT a known DEX exchange address
to_address / from_address sends and receives the same token

We can encapsulate the logic in the following pseudocode:

def is_token_cyclic(tx):
    """
    Is the token transfer in this transaction cyclic?
    If so its probably an arbitrage
    """
    # For every token transfer
    for tkn_xfer in tx.events.token_transfers:
        # Find the token being transferred
        token = tkn_xfer.address
        # The sender of said token
        sender = tkn_xfer.sender

        # Only care about the MEV bot's contract address
        if sender != tx.from_address or
           sender != tx.to_address:
           continue

        # Check against all other token transfers
        for other_tkn_xfer in tx.events.token_transfers:
            # If its not the same token
            # Ignore and continue
            if other_tkn_xfer.address != token:
                continue

            # Token is cyclic!
            # Sender buys and sells same asset
            if other_tkn_xfer.recipient == sender:
                return True

    return False


def is_arbitrage(tx):
    """
    Is the transaction an arbitrage?
    """
    # If its sending to OneInch / Uniswap router its
    # unlikely to be an arbitrage
    if tx.to_address in KNOWN_DEX_ADDRESS:
        return False
    return is_token_cyclic(tx)

Backrunning Bundles

Backrunning bundles are bundles where the searcher transaction is only inserted at the end of the bundle. This is because the opportunity only arises once the prior transactions have been executed.

# Backrunning
--start----------end-->
Tx0 -> Tx1 -> BackrunTx

To determine if the prior transactions hold any relevance to the final searcher transaction, we employ checks on the following properties:

Slots
Events

For the slots, if the preceeding transactions touches the same liquidity pair/pool slots, we have a high confidence that they are related. The logic roughly flows like so:

def is_txs_related_by_slots(tx1, tx2):
    """
    Are the transactions related? (Slot)
    """
    # Get relevant slots that were changed in the transaction
    tx1_lp_slots = filter(lambda x: x.address is in LP_POOLS, tx1.slots)
    tx2_lp_slots = filter(lambda x: x.address is in LP_POOLS, tx2.slots)

    # Find the intersection of said slots
    lp_slots_intersection = set(tx1_lp_slots).intersection(tx2_lp_slot)

    # If there's an interaction of LP slots, its probably related
    return len(lp_slots_intersection) > 0

Whereas for the events, its a bit more tricky, as you're looking for opposite swap events between txs (that needs to be accounted on a protocol by protocol basis).

def is_txs_related_by_events(tx1, tx2):
    """
    Are the transactions related? (Events)
    """
    # Swap events
    for swap_event in tx1.swap_events:
        for reverse_swap_event in tx2.swap_events:
            # Only care if its the same pool/pair
            if swap_event.address !=
                reverse_swap_event.address:
                continue

            # Tokens go opposite way
            if swap_event.in_token == reverse_swap_event.out_token or
                swap_event.out_token == reverse_swap_event.in_token:
                return True

    return False

Now if we combine the two, we will have a high degree of confidence that our identified backrunning bundles will be correct.

def is_backrunning(txs):
    """
    Given a list of txs, find out if they're related
    """
    # Searcher transaction is the final tx
    searcher_tx = txs[-1]
    non_searcher_txs = txs[:-1]

    # Keep iterating until a tx isn't relevant
    for other_tx in non_searcher_txs.reverse():
        # Exit loop if txs are not related
        if not is_txs_related_by_slots(searcher_tx, other_tx) or
             not is_txs_related_by_events(searcher_tx, other_tx):
            return False

    return True

Frontrunning and Backrunning Bundles

This is the most complicated type of bundles because of all the possible permutations. The naive way to classify these types of bundles is by assuming that the transactions need to occur at the start and end of the bundle.

# Naive assumption
--start---------------end-->
Frontrun -> Tx0 -> BackrunTx

This is true for simple cases, but does not generalize well as advanced searchers will perform a backrun and a frontrun in the same transaction to extract the maximum value.

# Advanced searcher 1
--start---------------------------------------end-->
Tx0 -> BackrunTx0 + FrontrunTx1 -> Tx1 -> BackrunTx1

You can also get more complicated bundles, i.e.

# Advanced searcher 2
--start----------------------------------------------------------------end-->
FrontrunTx -> Tx0 -> BackrunTx0 + FrontrunTx1 -> Tx1 -> Tx2 -> BackrunTx1+Tx2

Note that the possible permutations these orderings can take is factorial, so its in our best interest if our algorithm generalizes.

Interestingly we can achieve this simply by reusing most of the logic previously defined. Composition over inheritance!

For a simple frontrun and backrun, we know they're related when:

The frontrun tx and backrun tx closes a swap loop (i.e. is there an opposite swap event)

--start---------------end-->
Frontrun -> Tx0 -> BackrunTx

However we also know that the frontrun might contain a backrun within it. So, we can also assume that:

Any prior transactions before the first searcher tx that passes is_backrunning is part of the bundle
The final searcher transaction needs to close a swap loop with one of the txs between the frontrun and backrun

--start---------------------------------------end-->
Tx0 -> BackrunTx0 + FrontrunTx1 -> Tx1 -> BackrunTx1

# Note that in a case like so, we would classify them as separate bundles
--start------------------------------------------------------------end-->
Tx0 -> BackrunTx0 + FrontrunTx1 -> Tx1 -> BackrunTx1 -> Tx2 -> BackrunTx2

# i.e.
[Tx0 -> BackrunTx0 + FrontrunTx1 -> Tx1 -> BackrunTx1] -> [Tx2 -> BackrunTx2]

Lets try and encapsulate that into pseudocode:

def is_back_and_front_run(txs):
    # Extract searcher transactions from the list of txs
    searcher_txs = get_searcher_txs(txs)

    # Final tx needs to be searcher tx
    final_searcher_tx_idx = searcher_txs[-1]
    if final_searcher_tx_idx != len(txs):
        return False

    # Make sure searcher txs contain a close loop
    is_swap_closed = False
    for searcher_tx1_idx in range(0, len(searcher_txs) - 1):
        for searcher_tx2_idx in range(searcher_tx1_idx, len(searcher_txs)):
            front_tx = searcher_txs[searcher_tx1_idx]
            back_tx = searcher_txs[searcher_tx2_idx]

            if is_txs_related_by_slots(front_tx, back_tx) and
                is_txs_related_by_events(front_tx, back_tx):
                is_swap_closed = True
            if is_swap_closed:
                break
        if is_swap_closed:
            break

    # Loop doesn't even close
    if not is_swap_closed:
        return False

    # If the first tx isnt the searcher tx, make sure
    # its able to backrun
    first_searcher_tx_idx = txs.indexOf(searcher_txs[0])
    if first_searcher_tx_idx != 0:
        tx0 = txs[first_searcher_tx_idx]
        for i in range(0, first_searcher_tx_idx):
            tx1 = txs[i]

            # Its not a backrun, ser
            if not is_txs_related_by_slots(tx0, tx1) or
                not is_txs_related_by_events(tx0, tx1):
                return False

    # Make sure searcher txs closes a loop with a tx within
    final_searcher_tx = searcher_txs[-1]
    for i in range(first_searcher_tx_idx, final_searcher_tx_idx):
        tx_between = txs[i]

        # Its not a backrun, ser
        if not is_txs_related_by_slots(final_searcher_tx, tx_between) or
            not is_txs_related_by_events(final_searcher_tx, tx_between):
            return False

    # Pass all the checks!
    return True

Bundle Processing

Now that we have a way to identify the bundles, the next step is to process the bundle. Information that is of interest:

ETH tipped
ETH burned
Profit (in ETH and USDC)

Calculating ETH tipped/burned is simple: we look at the raw balance delta (obtained from balance_diffs) of the coinbase address (builder or validaor of the block).

The hard part is calculating the bundle profits given the various inventory changes:

What is the token balance delta before and after the bundle?
What is ETH/[TOKENS in bundle] exchange rate for the block?

Token Balance Delta

To calculate the token balance delta of the bundle, we rely on Transfer events (and on rare cases, Deposit/Withdraw events from WETH/sDAI).

Exchange Rate

Calculating the ETH/[TOKENS in bundle] exchange rate is a bit challenging for us due to how we architectured the data processing pipeline.

Remember, we are only relying on the extracted data from cryo. This means no querying the blockchain when processing historical chunks (this allows us to processing things much faster).

Instead we rely on on-chain swap events to determine the exchange rate of the token relative to ETH, which is painstakingly done on a protocol-by-protocol basis. e.g.

# Univ2 Sync
event Sync(uint112 reserve0, uint112 reserve1);

# Univ3 swap
event Swap(
    address sender,
    address recipient,
    int256 amount0,
    int256 amount1,
    uint160 sqrtPriceX96,
    uint128 liquidity,
    int24 tick
)

# Bancor
event TokensTraded(
    bytes32 contextId,
    Token indexed pool,
    Token indexed sourceToken,
    Token indexed targetToken,
    uint256 sourceAmount,
    uint256 targetAmount,
    uint256 bntAmount,
    uint256 targetFeeAmount,
    uint256 bntFeeAmount,
    address trader
);

# ... and more

You might ask, what happens if the tokens traded isn't against ETH? Good question. We have also extracted the price of popular base tokens (e.g. WBTC, USDC, USDT, DAI, rETH, stETH...) against ETH so we can perform conversions on the fly on a block-by-block basis. This has done well for the majority of cases.

With the inclusion the exchange rate of popular base tokens we are able to calculate the exchange rate of arbitrary tokens against ETH most of the time!

Bundle Post-Processing

Unfortunately, events are not the source of truth. Its the actual blockchain state that tells the truth. As such, we also have a post-processing pipeline:

We query the balance balance delta of tokens via eth_callMany on reth before and after the transaction index in the block
- AFAIK only reth and erigon it allows us to specify the tx index within the block to execute them on (as opposed to just before / after block)
We calculate the exchange rate of said tokens by checking them on popular exchanges (e.g. Uniswap / Sushiswap/ Balancer / Pancakeswap / Curve...)
- The exchange rate will also be calculated against block number - 1
  - If the exchange rate has changed significantly (i.e. x100 in one block), we will assume the token is worthless
  - Rare cases where the price of a valid asset x10 in a block (i.e. CRV on block 17,807,830 -> 17,807,831)
- If the liquidity is less than some threshold, we will also assume the token is worthless
- If the token passes the above rule, we will take the median exchange rate from all popular exchanges

Once the historical chunks have been processed, the post-processing pipeline will be executed against the top and bottom N-number of transactions (sorted by profits) to ensure the outliers are removed.

Anormalies - $PAW

If we were to follow this transaction

The exchange rate for $PAW would have been astronomical at the beginning (~100ETH = 1PAW).

In this bundle, searcher gained ~0.09 $ETH, but lost ~18.5 $PAW.

Without the post-processing pipeline, it would have been calculated that the searcher lost an equilavent of ~1800 ETH in this bundle, which is simply not true.

TLDR

FAQ

Why isn't [insert MEV address] here?

We only index whitelisted searchers, for now. If you'd like us to index it, post it in the telegram group

How are the exchange rate of tokens determined?

They're determined at the start of the bundle. For rare cases where there is low liquidity and the exchange rate is easily manipulated, we consider those tokens to be worthless. During post-processing, it'll be the median exchange rate against popular DEXes.