Proof of concept for querying spatial data on IPFS using geohash#

InterPlanetary File System (IPFS) offers an alternative approach to store and distribute spatial data. Size of spatial data grows exponentially so it is imperative to perform efficient partial selection of data to avoid large bulk download over the IPFS which can save both transaction time and local storage space. Yet there is a lack of support in spatial query functionality like PostGIS in the traditional database of PostgreSQL. This blog post introduces one type of spatial indexing – geohash and walks through a proof of concept to integrate geohash into the IPFS system. And the final section depicts a preliminary result for the implementation in comparison with solutions using local storage, local database, and vanilla IPFS without any spatial index.

Geohash#

Geohash is a public domain geocode system invented in 2008 by Gustavo Niemeyer which encodes a geographic location into a short string of letters and digits. Similar ideas were introduced by G.M. Morton in 1966. It is a hierarchical spatial data structure which subdivides space into buckets of grid shape, which is one of the many applications of what is known as a Z-order curve, and generally space-filling curves.

Compared to other prevalent hierarchical spatial data structure like R-tree and quadtree, geohash is using an absolute coordinate-encoding mapping which does not change value by the extent of dataset.

The 6g cell and its sub-grid

Following code snippets show basic usage of the Python package pygeohash to encode a coordinates to geohash and to decode a geohash to the centeriod coordinates

[ ]:
#!pip install pygeohash (https://github.com/wdm0006/pygeohash)
import pygeohash as pgh

pgh.encode(latitude=42.6, longitude=-5.6)
# >>> 'ezs42e44yx96'

pgh.encode(latitude=42.6, longitude=-5.6, precision=5)
# >>> 'ezs42'

pgh.decode(geohash='ezs42')
# >>> ('42.6', '-5.6')

Proof of concept: Mapping geohash with spatial asset via index folder#

geohash-aux-folder

The most straightforward solution is to create a hierarchical folder structure with subfolder names representing the geohash subgrids within the parent grid. The leaf node folder refers to geohash with the finest resolution, and contains CIDs that can point to the asset located in the leaf geohash. The entire folder would be packed and uploaded to IPFS as a separate asset. The query process consists of two steps: locate the target geohash; and recursively collect all the CIDs under the geohash node. Using the CIDs from the query steps, a partial retrieval can be achieved. Here is an example of 2600 restaurant points in Washington, DC, downloaded from OpenStreetMap.

[ ]:
##load geojson
import geopandas as gpd
import pandas as pd
import os

load point data and assign geohash to each point#

[ ]:
#demo_points = gpd.read_file(f"../../data/maryland_demo/dc_restaurants.geojson")

demo_points = pd.concat([demo_points,demo_points.get_coordinates()],axis=1)
demo_points['geohash'] = demo_points.apply(lambda row: pgh.encode(row['y'], row['x'],precision=6), axis=1)

“shatter” the points as individual geojson and calculate CIDs#

NOTE: make sure IPFS daemon is running before calculating CIDs

[ ]:
# Directory where the individual GeoJSON files will be saved
asset = "dc_restaurants"
directory = f"../data/geohash_{asset}"
# Make sure the directory exists, if not create it
if not os.path.exists(directory):
    os.makedirs(directory)
# Initialize an empty list to store file paths
file_paths = []
# Loop through each row in GeoDataFrame
for index, row in demo_points.iterrows():
    # Slice the GeoDataFrame to get a single feature (row)
    single_feature_gdf = demo_points.iloc[[index]]

    # Get 'osm_id' for the single feature
    osm_id = row['osm_id']

    # Define the full file path
    file_path = os.path.join(directory, f"{osm_id}.geojson")

    # Save single feature GeoDataFrame as GeoJSON
    single_feature_gdf.to_file(file_path, driver="GeoJSON")

    # Append file_path to list
    file_paths.append(file_path)

# Create a new column in the original GeoDataFrame to store file paths
demo_points['single_path'] = file_paths

[ ]:
#make sure ipfs daemon is running!!
def compute_cid(file_path):
    import subprocess
    cid = subprocess.check_output(["ipfs", "add", "-qn", file_path]).decode().strip()
    return cid

demo_points['single_cid'] = demo_points.apply(lambda x: compute_cid(x['single_path']),axis=1)

create index folder with a trie data structure#

[ ]:
class TrieNode:
    def __init__(self):
        self.children = {}
        self.value = []
class Trie:
    def __init__(self):
        self.root = TrieNode()

    def insert(self, index, value):
        node = self.root
        for char in str(index):
            if char not in node.children:
                node.children[char] = TrieNode()
            node = node.children[char]
        node.value.append(value)

    def get(self, index):
        node = self.root
        for char in str(index):
            if char not in node.children:
                return None
            node = node.children[char]
        return node.value

[ ]:
#arrange cid by geohash
pairs = list(zip(demo_points['geohash'],demo_points['single_cid']))
# Create an empty Trie dictionary
trie_dict = Trie()

# Insert each index-value pair into the Trie dictionary
for index, value in pairs:
    trie_dict.insert(index, value)

[ ]:
def export_trie(trie_node,geohash,root_path):
    #export geojson at current hash level
    next_path = root_path+"/"+"".join(geohash)
    leaf_path = root_path+f"/{geohash}.txt"
    print(geohash,root_path,next_path,leaf_path)
    if trie_node.value:
        # Open a file in write mode
        with open(leaf_path, 'w') as f:
            for item in trie_node.value:
                f.write(f"{item}\n")
    #make path and export to sub folder
    import os
    if trie_node.children and not os.path.exists(next_path):
        os.makedirs(next_path)
    for ch in trie_node.children:
        child_hash = geohash+ch
        export_trie(trie_node.children[ch],child_hash,next_path)

export_trie(trie_dict.root,"",f"../data/geohash_{asset}/index")

Performance benchmarking: query queen neighbors from a given geohash#

Example query to retrieve 329 out of 2675 point features: the target is to get nearby restaurants (green dots) from the neighboring geohash (blue grids) of a selected standing point (red dot) within a geohash grid (dqcjy)

query_dc

Calculate queen neighbor#

[ ]:
def queen_neighbors(geohash: str) -> list:
    import pygeohash as pgh
    nei = rook_neighbors(geohash)
    directions = ["right","bottom","left","top"]
    for i in range(4):
        nei.append(pgh.get_adjacent(nei[i],directions[i]))
    return nei
q_geohash = "dqcjy"
qn = queen_neighbors(q_geohash)

Query the cid list within the queen neighbor geohashes#

[3]:
# load index structure

## from file structure
def compose_path(s,root):
    """
    compose path a/ab/abc for geohash `abc`
    """
    path = [root]
    for i in range(len(s)):
        path.append(s[:i+1])
    return "/".join(path)

def process_leaf_node(leaf):
    """
    process index leaf.
    leaf: txt file path of a index leaf, like a//ab/abc.txt
    """
    with open(leaf, 'r', encoding='utf-8') as file:
        lines = file.readlines()
    return [line.strip() for line in lines]

def traverse_sub_node(node):
    """
    recursively collect all the leaf node under the current node
    """
    import os

    results=[]
    excludes = [".ipynb_checkpoints"]
    # Get list of items in the directory
    subfolders = [d for d in os.listdir(node) if os.path.isdir(os.path.join(node, d)) and d not in excludes]
    # If there are subfolders, traverse them
    if subfolders:
        for subfolder in subfolders:
            results.extend(traverse_sub_node(os.path.join(node, subfolder)))
    else:
        # Otherwise, process txt files in the directory
        txt_files = [f for f in os.listdir(node) if f.endswith('.txt')]
        for txt_file in txt_files:
            results.extend(process_leaf_node(os.path.join(node, txt_file)))
    return results
def query_feature_cid_by_geohash(geohash: str, index_root: str) -> list:
    """
    find matching geohash or sub-level hashs
    """
    import os
    target_path = compose_path(geohash,index_root)
    cid_list = []
    if os.path.exists(target_path):
        cid_list = traverse_sub_node(target_path)
    if os.path.exists(target_path+'.txt'):
        cid_list = process_leaf_node(target_path+'.txt')
    return cid_list

def ipfs_get_feature(cid):
    import subprocess
    subprocess.check_output(["ipfs", "get", cid])
    return gpd.read_file(f"./{cid}")

Benchmark for ipfs+geohash#

[ ]:
%%timeit

import subprocess
subprocess.check_output(["ipfs", "get", "-o","../data/test/dc","QmPaXWva3WQR2uwFdu6bkizUyRyKpX3V1aiT5dvnnYKSpJ"])
#query
results = multi_geohash_query(rn,f"../data/test/dc")
os.chdir("../data/test/")
#ipfs retrieval
ipfs_retrieval = pd.concat([ipfs_get_feature(cid) for cid in results])
os.chdir("../../notebooks")

Benchmark for local I/O+geohash#

[ ]:
%%timeit
#query
results = multi_geohash_query(rn,f"../data/geohash_{asset}/index")
local_io_retrieval = pd.concat([gpd.read_file(path) for path in gdf[gdf.single_cid.isin(results)].single_path.tolist()])

Benchmark for geopandas query#

[ ]:
%%timeit
#candidate_points = gpd.read_file(f"../../data/maryland_demo/{asset}_cid.geojson")
gpd.sjoin(candidate_points, neighbors, how="inner", op="within")

Benchmark for postgresql#

run docker daemon and the command to spin up postgresql docker-compose up

[ ]:
import geopandas as gpd
from sqlalchemy import create_engine
import psycopg2

# Load GeoJSON into a GeoDataFrame
gdf = gpd.read_file(f"../../data/maryland_demo/{asset}.geojson")

# Connect to PostgreSQL
engine = create_engine('postgresql://user:password@localhost:5432/geodb')

# Load data into PostgreSQL
gdf.to_postgis(f'{asset}', engine, if_exists='replace', index=False)

[ ]:
##Create spatial index
# Parameters for connection
params = {
    'dbname': 'geodb',
    'user': 'user',
    'password': 'password',
    'host': 'localhost',
    'port': '5432'
}

# Create a connection and cursor
conn = psycopg2.connect(**params)
cur = conn.cursor()

# Execute the CREATE INDEX command
cur.execute(f'CREATE INDEX ON {asset} USING gist(geometry)')

# Commit the changes and close the connection
conn.commit()
cur.close()
conn.close()

[ ]:
# store query target geodataframe into database
neighbors.crs = "EPSG:4326"
neighbors.to_postgis("temp_table", engine, if_exists="replace")

[ ]:
#benchmark

%%timeit
conn = psycopg2.connect(**params)
# Create a new cursor
cur = conn.cursor()

# Execute a COUNT SQL
sql = f"""
SELECT COUNT(*) FROM

(SELECT {asset}.geometry
FROM {asset}, temp_table
WHERE ST_Intersects({asset}.geometry, temp_table.geometry)) AS R;
"""
cur.execute(sql)

# Fetch the result
count = cur.fetchone()[0]

# Close the cursor and connection
cur.close()
conn.close()

Benchmark comparison#

Query Method for queen neighbor proximity

Number of nearby features (per loop time of 7 runs, 10 loops each)

Retrieval of nearby features

Geohash + local I/O

0.77 ms

3190 ms

Geohash + IPFS

257 ms

13500 ms

In-memory query (geopandas.sjoin )

20 ms

14600 ms

Postgis

21.3 ms

18.8 ms

From the results of local I/O, we can see pre-calculated geohash can significant reduce the query time at the specific task. However, IPFS online test indicates the I/O expense is still under optimized for IPFS system when retrieving the index file is inevitable. When including the file retrieval, partial retrieval with geohash overperformed the in-memory query where geohash is not available and all the assets need to be loaded into the memory. Yet local database system is still the optimized option when considering the retrieval process implying improvment in file exchange process within IPFS.

Try yourself#

To replicate the code locally, you may check out the project repository for environment setup.