90 Thousand line broken CSV file. UTF-8.

Cyrilic and hanzi characters/logograms showing...

... Latin-1 diacritic characters broken.

i.e. Flórez instead of Flórez

Attached is a zip file containing two text files.

I’ve removed most character-type garbage and (non-visible) termination characters. 

I wouldn’t use it to import into a database but this should be adequate for your stated requirements. 

Should you wish any further alterations please don’t hesitate to ask.

Final command used to clean, sort, remove duplicates, reshuffle and output to separate file:

⇒  gsed -e 's/[0-9]//g' -e 's/,/ /g' -e 's/é/é/g' -e 's/"//g' -e '/^$/d' -e 's/´/´/g' -e 's/ú/ú/g' -e 's/ñ/ñ/g' -e 's/ó/ó/g' -e 's/á/á/g' -e 's/ù/ù/g' -e 's/à /Á/g' -e 's/Ñ/Ñ/g' -e 's/  / /g' -e 's/[[:space:]]$//' -e 's/ò/ò/g' -e 's/í/í/g' -e 's/ì/ì/g' -e 's/ü/ü/g' -e 's/À/À/g' -e 's/ç/ç/g' -e 's/è/è/g' -e 's/É/É/g' -e 's/Ó/Ó/g' -e 's/ê/ê/g' -e 's/ä/ä/g' -e 's/Ê/Ê/g' -e 's/ö/ö/g' -e 's/Ú/Ú/g' -e 's/Ã^Í/Í/g' -e 's/Ã^Á/Á/g' -e 's/ÃŒ/Ì/g' -e 's/î/î/g' -e 's/ï/ï/g' -e 's/ã/ã/g' -e 's/Ä/Ä/g' -e 's/Ã/à/g' 90K_list_of_names.csv |sort |uniq | gshuf > 80K_clean_list.txt

Followed by a command to remove the newline (carriage return) characters and replace them with two spaces for design use:

⇒  gsed ':a;N;$!ba;s/\n/  /g' 80K_clean_list.txt > two_space_separated.txt

Show file with line breaks, tabs and non-printing characters

gcat -A thing.txt

Great reference: UTF-8 Encoding Debugging Chart



Published

30 May 2014

Category

hacking

Tags