Home Page
Other Posts
Provenance: Profiling and Comparing Overheads to enhance Data Provenance on large datasets
As a graduate student cum researcher of User Centric Data Systems, I was tasked to profile the performance overheads and determine bottlenecks in query tracing and data provenance on large datasets.
- Built the end-to-end platform for experimental analysis
- Python to examine the impact of query tracing
- SQL query operators and Jaeger to determine bottlenecks
- I built a Python library to examine the impact of query tracing and data provenance on large datasets
- The major task of enhancing SQL operations, post trade offs-determination, was achieved by batching and parallelizing non-blocking operators using Ray actors
- As this challenge is quite the bleeding edge, there is not much research performed and so, most of the development time was spent finding the right fit of libraries that involved navigating multiple documentations, blogs, and most importantly experimentations involving code
-
Outcome: gaining awareness in performance overheads so as to achieve Data Provenance.