Exploring Out-of-Core Computing with fdict in Python

Blake Bradford Avatar

·

Exploring Out-of-Core Computing with fdict in Python

In the world of data processing and analysis, managing large and complex datasets efficiently is crucial. Traditional in-memory computing can be inadequate for handling massive volumes of data. Out-of-core computing, which involves processing data that is too large to fit into memory, provides a solution to this problem. In this article, we will explore the use of fdict, a Python library that enables easy out-of-core computing with recursive data structures.

What is fdict?

fdict is a drop-in replacement for the standard dict in Python. It allows you to work with nested and recursive data structures efficiently. With fdict, you can initialize a dictionary with a nested structure using the fdict() constructor. This will create a flattened representation of the nested dict, making it easy to work with large and complex data.

Getting Started with fdict

To get started with fdict, you can simply import the library and use the fdict() constructor to create an instance of the flattened dict:

from fdict import fdict

d = fdict({'a': {'b': 1, 'c': [2, 3]}, 'd': 4})

In the example above, we create an fdict with a nested structure, where the keys are flattened using a delimiter. The resulting fdict will have the following structure:

{
'a/c': [2, 3],
'd': 4,
'a/b': 1
}

Working with Nested Dicts

fdict provides convenient methods for working with nested dicts. You can add new items to the fdict using nested keys, and fdict will automatically convert nested dicts into the flattened representation:

python
d['e'] = {'f': {'g': {'h': 5}}}

The above code adds a nested dict to the fdict. The resulting fdict will have the following structure:

{
'e/f/g/h': 5,
'a/c': [2, 3],
'd': 4,
'a/b': 1
}

You can also convert an fdict back to a standard dict using the to_dict_nested() method:

python
nested_dict = d.to_dict_nested()

Out-of-Core Computing with sfdict

To store fdict on disk for out-of-core computing, you can use the sfdict subclass of fdict, which uses the native Python module shelve internally. Here’s an example of using sfdict to store an fdict on disk:

from fdict import sfdict

d = sfdict(filename='myshelf.db')
d['a'] = {'b': 1, 'c': [2, 3]}
d.sync() # synchronize all changes back to disk
d.close() # always close the database

d2 = sfdict(filename='myshelf.db')
print(d2)
d2.close()

In the example above, we initialize an empty database using the sfdict constructor with a filename. We can then make changes to the database, sync the changes back to disk, and close the database. To reopen the database, we can create a new instance of sfdict with the same filename and access the data stored on disk.

Performance Considerations

While fdict provides an easy way to work with nested and recursive data structures, it’s important to consider the performance implications. fdict is generally slower than the standard dict for most operations due to class overhead and key string manipulation. However, for operations on leaves (non-dict objects), fdict provides similar performance to dict, with O(1) complexity.

Operations on nodes (nested dict objects) in fdict have a performance cost, as accessing direct children requires walking through all keys and filtering out the ones that do not match the current branch. This results in O(n) complexity, where n is the total number of items in the fdict. To mitigate this performance cost, fdict provides the extract() method to filter keys and build a new fdict with only the pertinent nested items. Additionally, the fastview=True argument can be used to enable FastView mode, which stores lookup tables for direct children access and improves the performance of operations on nodes.

Conclusion

fdict provides a convenient and efficient way to work with nested and recursive data structures in Python. By replacing the standard dict with fdict, you can handle large and complex datasets efficiently, even when they exceed the memory capacity. The sfdict subclass of fdict enables out-of-core computing by storing the data on disk using the shelve module. While fdict may have some performance trade-offs, it offers valuable features for prototyping big data applications and can be replaced with a faster datatype when necessary.

For more information, consult the fdict documentation on the GitHub repository.

References

Leave a Reply

Your email address will not be published. Required fields are marked *