how does azure sql dw distribution work?

my understanding is that storage and compute are decoupled on azure sql dw. there is a concept of hashing on other MPP databases where storage and compute are located on the same server MPP nodes; as an example, one such database is amazon redshift. another is ibm netezza or puredata for analytics. on both systems, storage and compute are on the same server node. data is spread across the n nodes of the system, either randomly, or based on a designated hash key; say something like customer id.

i’m struggling to understand this concept on sql azure dw; as the compute and storage are no longer linked. however, per create table ddl – – one can see in the definition that the distribute clause is supported; in fact there are comments for the differing defaults between pdw and azure sql dw. there is another post with more detail about this clause here: – i’d just like to confirm that i’m understanding this correctly – it seems to have nothing to do with collocating data on the same compute and storage (which is what distribution does to improve joins on redshift/netezza) – but is instead some kind of run time partition of compute resources? which i only benefit from when it is synchronized between two tables? ie. customer dimension and transaction fact are both distributed on (customer_id)? if not, they need to have data rebalanced at the compute nodes to get the data collocated for a join? does this also mean the max size for an azure dw at present is 60 nodes?

edit: i do see it nicely in effect here: and per i do see the maximum number of compute nodes is 60. i also do see the data is sharded on the file system level. so instead of a data partition, this seems like a new option to set a ‘compute partition’.

submitted by /u/igobyplane_com
[link] [comments]

Leave a Reply