Today I came up with a task for myself, mostly for fun, but it also has a useful application. I have a dump of some MySQL table in csv
, it’s ~500K records. The data will be loaded into Neo4j, though I want to speed up the load and be able to parallelize the process. Of course, I can make it in any programming language, but I decided to practice with Linux shell commands and to use gawk.
The example structure of the file is like this:
There is a requirement: records with the same Id
column must be processed together. Basically, that means, they cannot be in different chunks, even if we exceed the limit of a chunk a little, say, have 1003 records instead of 1000.
The first step would be to sort the file by Id
column. But it will be even easier for my loader to work, if the file is sorted secondary by the Sequence
column. To simplify, let’s get rid of the header too.
A little explanation. AWK’s BEGIN statement is executed only once - before processing the first line. Here getline;
will just read the header line, but it won’t be printed, so the real output with print $0
will start actually from the second line. Now sort:
-k2
means sort by the column number 2
(initial is 1
), n
means numerical sort. The same for the secondary sort. In MySQL the same result will be achieved with ORDER BY Id ASC, Sequence ASC
. --field-separator
is necessary, because the default separator is a space, but we have a comma. Now we have this:
Let’s assume we want to split the file into chunks by 2 records. The natural approach with split doesn’t work, because it breaks the sequence. Indeed, if we do like:
The third record with Id = 1
got to the second chunk, but should be in the first. To solve the problem I played around with AWK a little and wrote the following script:
Run the script:
Pay attention to -F,
argument, which says to use ,
as a field separator. -f split.awk
means to load script from file. For our input file it creates 4 chunks with the following content:
Exactly what I need: the records with ids 1
and 3
are in the same chunk, though its size is larger than the limit of 2 records.