Saturday, September 25, 2010

FPGA for data analytics

On September 20, IBM announced the acquisition of Netezza a data warehouse appliance company. Delivering high performance analytic solution using appliance approach is made possible using a commodity chip technology called FPGA. I did not know the use of FPGA in data analytics/query processing, until this Thursday, when I was talking to one of my colleagues and mentors at work. This made me more curious about the architecture of Netezza appliance and use of FPGA in general.
What is FPGA and why is it popular in large data streaming?
As it stands for Field Programmable Gate Array, the FPGA chip can be programmed by customer after manufacturing. They contain Programmable logic blocks capable of performing combinational logic.
Since high performance business analytics means operating on very large data sets, traditional warehouse systems struggle to move data in low latency from disk over network. The Netezza appliance exploits the use of FPGA to filter out extraneous data from the source to eliminate moving them out of the disk. The approach frees up CPU, memory and network to process data that is not needed for the query to satisfy the condition hence boosting the performance 10 to 100 times compared to traditional system. The key building block of a Netezza appliance inlcudes:

  • Netezza host is a linux SMP server that presents standard tools and configuration to the user. The host is a software layer which compiles SQL queries into executable code snippets, create optimized query plans and submit those snippets to the MPP nodes for execution.
  • S-blades - These are high performing blade servers with multi-core CPUs and FPGA. The programmable software called FAST engine reside on FPGA which does the magic. Take a look at the picture on the right, the FAST engine uses direct memory access of compressed data, uncompresses it and passes to project and restrict engines which filter out columns and rows respectively based on the parameters of the SELECT and WHERE clauses of SQL query. The filtered rows are very low percentage of the original data. This data is then given back to memory for processing by CPU cores. 
Query results are sent over customized fast network and aggregated by host to present to users. 
References:

1 comment:

  1. Hi,

    Nicely written article. Could you throw some more light on how exactly does the FPGA help in improving the performance.

    ReplyDelete

Make Everyone Smile

Hey there! Just wanted to let you know that today is officially National 'Make Everyone Smile' Day! So, consider yourself officially...