[Question] Simple CSV Row Consolidation

I have to apologize up front, since my title is a bit misleading. The task is simple, but the scale of the data is the problem. I have just over half a million records in the following format:

referenceNumber response
00001 A
00001 B
00001 C
00002 A
00002 C
00003 D
00004 A
00004 B
00004 C
00004 E

The goal is to have the data formatted like so,

referenceNumber response
00001 A;B;C
00002 A;C
00003 D
00004 A;B;C;E

I have a tiny 12 line script which accomplishes this quite easily – but it does not work at this scale.


A sample of 250 records processes almost instantly. A sample of 25,000 records takes 30 seconds to load, and then about 20 seconds to process.

I ran this for the entire data set of ~500,000+ records. It has been (as of now) 2 Hours and 40 Minutes – and it has not even finished loading. Powershell makes up for 11 – 15% of my CPU usage and has slowly climbed from ~250MB of memory utilization to currently 1.2GB of memory and is still growing.

I have a feeling that this is due to the way in which the script works, which is to first group everything together by referenceNumber and then write that information to a PSObject – which is what I think is the reason the memory usage has grown so large. The final part should be trivial, which is to just write the object out as a CSV.

My question is, does anyone know how to write this in a way so that the first record is compared to all other records and then written to CSV, and then repeats this process for all remaining records? This way I can pause/play the script and use the data while the rest is processing.

Based on my 25,000 record test, I estimated that it should have only taken ~20 minutes to complete all the records if the time to complete was linear. But it has become obvious that the size of the file dramatically impacts the time to complete in a non-linear way.

Comments would be appreciated.

submitted by /u/TypicalTim
[link] [comments]

Leave a Reply