I _Really_ Don't Know

A low-frequency blog by Rob Styles

Big Data, Large Batches and My Mistake

This is week 9 for me in my new challenge at Callcredit. I wrote a bit about what we’re doing last time and can’t write much about the detail right now as the product we’re building is secret. Credit bureaus are a secretive bunch, culturally. Probably not a bad thing given what they know about us all.

Don’t expect a Linked Data tool or product. What we’re building is firmly in Callcredit’s existing domain.

As well as the new job, I’ve been reading Eric Ries’ The Lean Startup, tracking Big Data news and developing this app. This weekend the combination of these things became a perfect storm that let me to a D’Oh! moment.

One of the many key points in Lean Startup is to maximise learning by getting stuff out as quickly as possible. The main aspect of getting stuff out is to work in small batches. There are strong parallels here with Agile development practices and the need to get a single end-to-end piece of functionality working quickly.

This GigaOm piece on Hadoop’s days being numbered describes the need for faster, smaller batches too; in the context of data analysis responses and incremental changes to data. It introduces a number of tools, some of which I’ve looked at and some I haven’t.

The essence of moving to tools like Percolator, Dremel and Giraph is to reduce the time to learning; to shorten the time it takes to get round the data processing loop.

So, knowing all of this, why have I been working in large batches? I’ve spent the last few weeks building out quite detailed data conversions, but without a UI on the front to make any value from it! Why, given everything I know and all that I’ve experienced didn’t I build a narrow end-to-end system that could be quickly broadened out?

A mixture of reasons, all of which aren’t really valid, just tricks of the mind.

Yesterday I started to fix this and built a small batch, end-to-end, run that I can release soon for internal review.