Home Page
Other Posts

Provenance: Profiling and Comparing Overheads to enhance Data Provenance on large datasets

As a graduate student cum researcher of User Centric Data Systems, I was tasked to profile the performance overheads and determine bottlenecks in query tracing and data provenance on large datasets.

Built the end-to-end platform for experimental analysis

Python to examine the impact of query tracing
SQL query operators and Jaeger to determine bottlenecks

I built a Python library to examine the impact of query tracing and data provenance on large datasets
The major task of enhancing SQL operations, post trade offs-determination, was achieved by batching and parallelizing non-blocking operators using Ray actors
As this challenge is quite the bleeding edge, there is not much research performed and so, most of the development time was spent finding the right fit of libraries that involved navigating multiple documentations, blogs, and most importantly experimentations involving code
Outcome: gaining awareness in performance overheads so as to achieve Data Provenance.