Can anyone tell me how I can make GAUSS read a fixed width text file, that looks somewhat like the example below?
I have many such files with around 50,000 lines. The first three variables are strings, and then 8 integers and finally a couple of strings again.
F773300 0100 960000 2179 F
F773300 3110 09130 760466 0 0 0 185085 945551
H030000 0100 030000 10114 F
H030000 3110 01130 10114 0 0 0 0 10114 FFFF
H107120 0100 100040 20645 F
H107120 3110 01110 20645 0 0 0 0 20645 FFFF
H140000 0100 140000 8258 F
H140000 3110 03113 8258 0 0 0 0 8258 FFFF
H160000 0100 160000 33032 F
Best regards
3 Answers
1
accepted
The following solution should work. There are a couple places that will be simplified with the next maintenance release (such as allowing string arrays to work with strsect instead of just dataframes).
Save the following contents to a file named 'fixedwidth.csv':
F773300 0100 960000 2179 F
F773300 3110 09130 760466 0 0 0 185085 945551
H030000 0100 030000 10114 F
H030000 3110 01130 10114 0 0 0 0 10114 FFFF
H107120 0100 100040 20645 F
H107120 3110 01110 20645 0 0 0 0 20645 FFFF
H140000 0100 140000 8258 F
H140000 3110 03113 8258 0 0 0 0 8258 FFFF
H160000 0100 160000 33032 F
and run the following example
// Pick an arbitrary character for a delimiter that won't be present in the data.
data = csvReadSA(__FILE_DIR$+"fixedwidth.csv", 1, 1, "^");
// Define types we'd like fields to have
types = META_TYPE_CATEGORY|
META_TYPE_CATEGORY|
META_TYPE_STRING|
META_TYPE_NUMBER|
META_TYPE_NUMBER|
META_TYPE_NUMBER|
META_TYPE_NUMBER|
META_TYPE_NUMBER|
META_TYPE_NUMBER|
META_TYPE_NUMBER|
META_TYPE_CATEGORY;
// Provide the start position of each field. We will add -1 later for "until end of line"
fields = 1|13|23|38|56|61|65|69|73|92|108;
// Example with the given data above
df = offset_strs_to_df(data, types, fields);
proc offset_strs_to_df(data, types, fields);
local df;
// When GAUSS 23.0.1 is released, this won't be necessary.
data = asdf(data);
// Create output symbol to write to.
df = asdf(zeros(rows(data), rows(types)));
// Add -1 to the end of the field offsets so the last one is considered "to end of line"
fields = fields|-1;
for i(1, rows(fields) - 1, 1);
// Grab value from each row.
df[.,i] = strtrim(strsect(data, fields[i], fields[i+1] - fields[i]));
// Change empty values into missings, and convert to a numeric vector.
if types[i] == META_TYPE_NUMBER;
df[.,i] = strtof(reclassify(ntos(df[.,i]), "", "."));
endif;
// We could assign the column directly (as shown below), or concat, but this more performant.
df = dftype(move(df), types[i], i);
endfor;
retp(df);
endp;
0
It looks like the file contains some missing data and it is represented by spaces and the file does not have a header row. Is this correct?
Also, do you know the offsets to the start of each column?
0
Yes, not every record has observations for all variables, so they are represented by spaces
No, the file does not have a header row
Yes, I know the offset to the start of every variable
Your Answer
3 Answers
The following solution should work. There are a couple places that will be simplified with the next maintenance release (such as allowing string arrays to work with strsect instead of just dataframes).
Save the following contents to a file named 'fixedwidth.csv':
F773300 0100 960000 2179 F
F773300 3110 09130 760466 0 0 0 185085 945551
H030000 0100 030000 10114 F
H030000 3110 01130 10114 0 0 0 0 10114 FFFF
H107120 0100 100040 20645 F
H107120 3110 01110 20645 0 0 0 0 20645 FFFF
H140000 0100 140000 8258 F
H140000 3110 03113 8258 0 0 0 0 8258 FFFF
H160000 0100 160000 33032 F
and run the following example
// Pick an arbitrary character for a delimiter that won't be present in the data.
data = csvReadSA(__FILE_DIR$+"fixedwidth.csv", 1, 1, "^");
// Define types we'd like fields to have
types = META_TYPE_CATEGORY|
META_TYPE_CATEGORY|
META_TYPE_STRING|
META_TYPE_NUMBER|
META_TYPE_NUMBER|
META_TYPE_NUMBER|
META_TYPE_NUMBER|
META_TYPE_NUMBER|
META_TYPE_NUMBER|
META_TYPE_NUMBER|
META_TYPE_CATEGORY;
// Provide the start position of each field. We will add -1 later for "until end of line"
fields = 1|13|23|38|56|61|65|69|73|92|108;
// Example with the given data above
df = offset_strs_to_df(data, types, fields);
proc offset_strs_to_df(data, types, fields);
local df;
// When GAUSS 23.0.1 is released, this won't be necessary.
data = asdf(data);
// Create output symbol to write to.
df = asdf(zeros(rows(data), rows(types)));
// Add -1 to the end of the field offsets so the last one is considered "to end of line"
fields = fields|-1;
for i(1, rows(fields) - 1, 1);
// Grab value from each row.
df[.,i] = strtrim(strsect(data, fields[i], fields[i+1] - fields[i]));
// Change empty values into missings, and convert to a numeric vector.
if types[i] == META_TYPE_NUMBER;
df[.,i] = strtof(reclassify(ntos(df[.,i]), "", "."));
endif;
// We could assign the column directly (as shown below), or concat, but this more performant.
df = dftype(move(df), types[i], i);
endfor;
retp(df);
endp;
It looks like the file contains some missing data and it is represented by spaces and the file does not have a header row. Is this correct?
Also, do you know the offsets to the start of each column?
Yes, not every record has observations for all variables, so they are represented by spaces
No, the file does not have a header row
Yes, I know the offset to the start of every variable