Aptech Store

Read fixed width text file?

Can anyone tell me how I can make GAUSS read a fixed width text file, that looks somewhat like the example below?

I have many such files with around 50,000 lines. The first three variables are strings, and then 8 integers and finally a couple of strings again.

F773300       0100      960000                  2179                                                                                          F
F773300        3110      09130                760466            0    0    0    185085                     945551
H030000     0100      030000                10114                                                                                          F
H030000      3110      01130                    10114             0    0   0    0                                   10114           FFFF
H107120       0100      100040               20645                                                                                           F
H107120        3110      01110                   20645             0    0    0    0                                 20645           FFFF
H140000      0100      140000                 8258                                                                                           F
H140000       3110      03113                     8258             0    0    0    0                                   8258           FFFF
H160000      0100      160000                33032                                                                                          F

Best regards

 

3 Answers



1



accepted

The following solution should work. There are a couple places that will be simplified with the next maintenance release (such as allowing string arrays to work with strsect instead of just dataframes).

Save the following contents to a file named 'fixedwidth.csv':

F773300 0100 960000 2179 F
F773300 3110 09130 760466 0 0 0 185085 945551
H030000 0100 030000 10114 F
H030000 3110 01130 10114 0 0 0 0 10114 FFFF
H107120 0100 100040 20645 F
H107120 3110 01110 20645 0 0 0 0 20645 FFFF
H140000 0100 140000 8258 F
H140000 3110 03113 8258 0 0 0 0 8258 FFFF
H160000 0100 160000 33032 F

and run the following example

// Pick an arbitrary character for a delimiter that won't be present in the data.
data = csvReadSA(__FILE_DIR$+"fixedwidth.csv", 1, 1, "^");
// Define types we'd like fields to have
types = META_TYPE_CATEGORY|
META_TYPE_CATEGORY|
META_TYPE_STRING|
META_TYPE_NUMBER|
META_TYPE_NUMBER|
META_TYPE_NUMBER|
META_TYPE_NUMBER|
META_TYPE_NUMBER|
META_TYPE_NUMBER|
META_TYPE_NUMBER|
META_TYPE_CATEGORY;
// Provide the start position of each field. We will add -1 later for "until end of line"
fields = 1|13|23|38|56|61|65|69|73|92|108;
// Example with the given data above
df = offset_strs_to_df(data, types, fields);
proc offset_strs_to_df(data, types, fields);
local df;
// When GAUSS 23.0.1 is released, this won't be necessary.
data = asdf(data);
// Create output symbol to write to.
df = asdf(zeros(rows(data), rows(types)));
// Add -1 to the end of the field offsets so the last one is considered "to end of line"
fields = fields|-1;
for i(1, rows(fields) - 1, 1);
// Grab value from each row.
df[.,i] = strtrim(strsect(data, fields[i], fields[i+1] - fields[i]));
// Change empty values into missings, and convert to a numeric vector.
if types[i] == META_TYPE_NUMBER;
df[.,i] = strtof(reclassify(ntos(df[.,i]), "", "."));
endif;
// We could assign the column directly (as shown below), or concat, but this more performant.
df = dftype(move(df), types[i], i);
endfor;
retp(df);
endp;

aptech

1,773


0



It looks like the file contains some missing data and it is represented by spaces and the file does not have a header row. Is this correct?

Also, do you know the offsets to the start of each column?

aptech

1,773


0



Yes, not every record has observations for all variables, so they are represented by spaces

No, the file does not have a header row

Yes, I know the offset to the start of every variable

 

Your Answer

3 Answers

1
accepted

The following solution should work. There are a couple places that will be simplified with the next maintenance release (such as allowing string arrays to work with strsect instead of just dataframes).

Save the following contents to a file named 'fixedwidth.csv':

F773300 0100 960000 2179 F
F773300 3110 09130 760466 0 0 0 185085 945551
H030000 0100 030000 10114 F
H030000 3110 01130 10114 0 0 0 0 10114 FFFF
H107120 0100 100040 20645 F
H107120 3110 01110 20645 0 0 0 0 20645 FFFF
H140000 0100 140000 8258 F
H140000 3110 03113 8258 0 0 0 0 8258 FFFF
H160000 0100 160000 33032 F

and run the following example

// Pick an arbitrary character for a delimiter that won't be present in the data.
data = csvReadSA(__FILE_DIR$+"fixedwidth.csv", 1, 1, "^");
// Define types we'd like fields to have
types = META_TYPE_CATEGORY|
META_TYPE_CATEGORY|
META_TYPE_STRING|
META_TYPE_NUMBER|
META_TYPE_NUMBER|
META_TYPE_NUMBER|
META_TYPE_NUMBER|
META_TYPE_NUMBER|
META_TYPE_NUMBER|
META_TYPE_NUMBER|
META_TYPE_CATEGORY;
// Provide the start position of each field. We will add -1 later for "until end of line"
fields = 1|13|23|38|56|61|65|69|73|92|108;
// Example with the given data above
df = offset_strs_to_df(data, types, fields);
proc offset_strs_to_df(data, types, fields);
local df;
// When GAUSS 23.0.1 is released, this won't be necessary.
data = asdf(data);
// Create output symbol to write to.
df = asdf(zeros(rows(data), rows(types)));
// Add -1 to the end of the field offsets so the last one is considered "to end of line"
fields = fields|-1;
for i(1, rows(fields) - 1, 1);
// Grab value from each row.
df[.,i] = strtrim(strsect(data, fields[i], fields[i+1] - fields[i]));
// Change empty values into missings, and convert to a numeric vector.
if types[i] == META_TYPE_NUMBER;
df[.,i] = strtof(reclassify(ntos(df[.,i]), "", "."));
endif;
// We could assign the column directly (as shown below), or concat, but this more performant.
df = dftype(move(df), types[i], i);
endfor;
retp(df);
endp;

0

It looks like the file contains some missing data and it is represented by spaces and the file does not have a header row. Is this correct?

Also, do you know the offsets to the start of each column?

0

Yes, not every record has observations for all variables, so they are represented by spaces

No, the file does not have a header row

Yes, I know the offset to the start of every variable

 


You must login to post answers.

Have a Specific Question?

Get a real answer from a real person

Need Support?

Get help from our friendly experts.