Hasnain Reads

Hasnain says:

Now this is some really impressive work, taking costs from $1.8M/yr to $10k/yr for log storage. I liked how it was an iterative process, massaging and moving around data till it can be compressed much better. Reminds me of some work we did back in the day to split up data a little for better compression. The wins are huge!

“We have deployed Phase 1 (i.e., the custom Log4j appender with our custom float encoding) across our entire Spark platform. We are currently working on deploying the Phase 2 compression and integrating CLP’s search capability into our analytics and observability platforms.

Result of Phase 1 compression: In a 30-day window, our entire Spark ecosystem generated 5.38PB of uncompressed INFO level unstructured logs yet our CLP appender compressed them to only 31.4TB, amounting to an unprecedented 169x compression ratio. Now with CLP, we have restored our log verbosity from WARN back to INFO, and we can afford to retain all the logs for 1 month (as requested by our engineers).

Preliminary result of Phase 2 compression: The above mentioned result is only the size of the compressed IR. We have tested a prototype of CLP’s complete compression (including both Phase 1 and 2) on a subset of our Spark logs, and CLP’s compression ratio is 2.16x higher than Zstandard’s ratio and 2.28x higher than Gzip’s ratio. This is consistent with the results reported on other log datasets. “

Posted on 2022-10-01T16:19:03+0000