Read fixed width text file?

Question

Can anyone tell me how I can make GAUSS read a fixed width text file, that looks somewhat like the example below?

I have many such files with around 50,000 lines. The first three variables are strings, and then 8 integers and finally a couple of strings again.

F773300 0100 960000 2179 F
F773300 3110 09130 760466 0 0 0 185085 945551
H030000 0100 030000 10114 F
H030000 3110 01130 10114 0 0 0 0 10114 FFFF
H107120 0100 100040 20645 F
H107120 3110 01110 20645 0 0 0 0 20645 FFFF
H140000 0100 140000 8258 F
H140000 3110 03113 8258 0 0 0 0 8258 FFFF
H160000 0100 160000 33032 F

Best regards

3 Answers

Your Answer

aptech · Answer 1

The following solution should work. There are a couple places that will be simplified with the next maintenance release (such as allowing string arrays to work with strsect instead of just dataframes).

Save the following contents to a file named 'fixedwidth.csv':

 F773300     0100      960000         2179                                                                  F 
F773300     3110      09130          760466            0    0   0       185085             945551 
H030000     0100      030000         10114                                                                 F 
H030000     3110      01130          10114             0    0   0   0                      10114           FFFF 
H107120     0100      100040         20645                                                                 F 
H107120     3110      01110          20645             0    0   0   0                      20645           FFFF 
H140000     0100      140000         8258                                                                  F 
H140000     3110      03113          8258              0    0   0   0                      8258            FFFF 
H160000     0100      160000         33032                                                                 F

and run the following example

 // Pick an arbitrary character for a delimiter that won't be present in the data.
data = csvReadSA(__FILE_DIR$+"fixedwidth.csv", 1, 1, "^");
 
// Define types we'd like fields to have
types = META_TYPE_CATEGORY|
        META_TYPE_CATEGORY|
        META_TYPE_STRING|
        META_TYPE_NUMBER|
        META_TYPE_NUMBER|
        META_TYPE_NUMBER|
        META_TYPE_NUMBER|
        META_TYPE_NUMBER|
        META_TYPE_NUMBER|
        META_TYPE_NUMBER|
        META_TYPE_CATEGORY;
 
// Provide the start position of each field. We will add -1 later for "until end of line"
fields = 1|13|23|38|56|61|65|69|73|92|108;
 
// Example with the given data above
df = offset_strs_to_df(data, types, fields);
 
proc offset_strs_to_df(data, types, fields);
    local df;
 
    // When GAUSS 23.0.1 is released, this won't be necessary.
    data = asdf(data);
 
    // Create output symbol to write to.
    df = asdf(zeros(rows(data), rows(types)));
 
    // Add -1 to the end of the field offsets so the last one is considered "to end of line"
    fields = fields|-1;
 
    for i(1, rows(fields) - 1, 1);
        // Grab value from each row.
        df[.,i] = strtrim(strsect(data, fields[i], fields[i+1] - fields[i]));
 
        // Change empty values into missings, and convert to a numeric vector.
        if types[i] == META_TYPE_NUMBER;
            df[.,i] = strtof(reclassify(ntos(df[.,i]), "", "."));
        endif;
 
        // We could assign the column directly (as shown below), or concat, but this more performant.
        df = dftype(move(df), types[i], i);
    endfor;
 
    retp(df);
endp;

link

aptech

1,773

aptech · Answer 2

It looks like the file contains some missing data and it is represented by spaces and the file does not have a header row. Is this correct?

Also, do you know the offsets to the start of each column?

link

aptech

1,773

Peter_dk · Answer 3

Yes, not every record has observations for all variables, so they are represented by spaces

No, the file does not have a header row

Yes, I know the offset to the start of every variable

link

Peter_dk

4

aptech · Answer 4

The following solution should work. There are a couple places that will be simplified with the next maintenance release (such as allowing string arrays to work with strsect instead of just dataframes).

Save the following contents to a file named 'fixedwidth.csv':

 F773300     0100      960000         2179                                                                  F 
F773300     3110      09130          760466            0    0   0       185085             945551 
H030000     0100      030000         10114                                                                 F 
H030000     3110      01130          10114             0    0   0   0                      10114           FFFF 
H107120     0100      100040         20645                                                                 F 
H107120     3110      01110          20645             0    0   0   0                      20645           FFFF 
H140000     0100      140000         8258                                                                  F 
H140000     3110      03113          8258              0    0   0   0                      8258            FFFF 
H160000     0100      160000         33032                                                                 F

and run the following example

 // Pick an arbitrary character for a delimiter that won't be present in the data.
data = csvReadSA(__FILE_DIR$+"fixedwidth.csv", 1, 1, "^");
 
// Define types we'd like fields to have
types = META_TYPE_CATEGORY|
        META_TYPE_CATEGORY|
        META_TYPE_STRING|
        META_TYPE_NUMBER|
        META_TYPE_NUMBER|
        META_TYPE_NUMBER|
        META_TYPE_NUMBER|
        META_TYPE_NUMBER|
        META_TYPE_NUMBER|
        META_TYPE_NUMBER|
        META_TYPE_CATEGORY;
 
// Provide the start position of each field. We will add -1 later for "until end of line"
fields = 1|13|23|38|56|61|65|69|73|92|108;
 
// Example with the given data above
df = offset_strs_to_df(data, types, fields);
 
proc offset_strs_to_df(data, types, fields);
    local df;
 
    // When GAUSS 23.0.1 is released, this won't be necessary.
    data = asdf(data);
 
    // Create output symbol to write to.
    df = asdf(zeros(rows(data), rows(types)));
 
    // Add -1 to the end of the field offsets so the last one is considered "to end of line"
    fields = fields|-1;
 
    for i(1, rows(fields) - 1, 1);
        // Grab value from each row.
        df[.,i] = strtrim(strsect(data, fields[i], fields[i+1] - fields[i]));
 
        // Change empty values into missings, and convert to a numeric vector.
        if types[i] == META_TYPE_NUMBER;
            df[.,i] = strtof(reclassify(ntos(df[.,i]), "", "."));
        endif;
 
        // We could assign the column directly (as shown below), or concat, but this more performant.
        df = dftype(move(df), types[i], i);
    endfor;
 
    retp(df);
endp;

link

aptech

1,773

aptech · Answer 5

It looks like the file contains some missing data and it is represented by spaces and the file does not have a header row. Is this correct?

Also, do you know the offsets to the start of each column?

link

aptech

1,773

Peter_dk · Answer 6

Yes, not every record has observations for all variables, so they are represented by spaces

No, the file does not have a header row

Yes, I know the offset to the start of every variable

link

Peter_dk

4

Read fixed width text file?

3 Answers

Your Answer

3 Answers

You must login to post answers.

Have a Specific Question?

Need Support?

	F773300 0100 960000 2179 F
	F773300 3110 09130 760466 0 0 0 185085 945551
	H030000 0100 030000 10114 F
	H030000 3110 01130 10114 0 0 0 0 10114 FFFF
	H107120 0100 100040 20645 F
	H107120 3110 01110 20645 0 0 0 0 20645 FFFF
	H140000 0100 140000 8258 F
	H140000 3110 03113 8258 0 0 0 0 8258 FFFF
	H160000 0100 160000 33032 F

	// Pick an arbitrary character for a delimiter that won't be present in the data.
	data = csvReadSA(__FILE_DIR$+"fixedwidth.csv", 1, 1, "^");

	// Define types we'd like fields to have
	types = META_TYPE_CATEGORY\|
	META_TYPE_CATEGORY\|
	META_TYPE_STRING\|
	META_TYPE_NUMBER\|
	META_TYPE_NUMBER\|
	META_TYPE_NUMBER\|
	META_TYPE_NUMBER\|
	META_TYPE_NUMBER\|
	META_TYPE_NUMBER\|
	META_TYPE_NUMBER\|
	META_TYPE_CATEGORY;

	// Provide the start position of each field. We will add -1 later for "until end of line"
	fields = 1\|13\|23\|38\|56\|61\|65\|69\|73\|92\|108;

	// Example with the given data above
	df = offset_strs_to_df(data, types, fields);

	proc offset_strs_to_df(data, types, fields);
	local df;

	// When GAUSS 23.0.1 is released, this won't be necessary.
	data = asdf(data);

	// Create output symbol to write to.
	df = asdf(zeros(rows(data), rows(types)));

	// Add -1 to the end of the field offsets so the last one is considered "to end of line"
	fields = fields\|-1;

	for i(1, rows(fields) - 1, 1);
	// Grab value from each row.
	df[.,i] = strtrim(strsect(data, fields[i], fields[i+1] - fields[i]));

	// Change empty values into missings, and convert to a numeric vector.
	if types[i] == META_TYPE_NUMBER;
	df[.,i] = strtof(reclassify(ntos(df[.,i]), "", "."));
	endif;

	// We could assign the column directly (as shown below), or concat, but this more performant.
	df = dftype(move(df), types[i], i);
	endfor;

	retp(df);
	endp;