Friday, May 15, 2009

Syntax for parsing tuples into C arrays

Parsing tuples into C arrays is quite painful with current C API. Below is is an example when you don't know the tuple size before hand:

....
//ref: http://docs.python.org/c-api/arg.html
PyObject* pyObj;
if(!PyArg_ParseTuple(args,"O",
&pyObj)
|| !pyTuple_Check(pyObj))
return NULL;

array = (double*) malloc(sizeof(double)*tupleSize);
for(i=0; i < tupleSize; i++ ) {
tupleItem = PyTuple_GetItem(pyObj, i);
if( PyFloat_Check(tupleItem) ) {
array[i] = PyFloat_AsDouble(tupleItem);
}else{
printf("Error: tuple contains a non-float value");
exit(1);
}
}
....

I'd like to propose a syntax to facilitate that kind of tasks. There is already syntax for parsing strings:

{
const char * s; int i;
PyArg_ParseTyple(args, "s#", &s, &i);
}

Consider the string as a sequence of char's. Then note that in python, a tuple of int's, is also a sequence of int's. This perspective provides a natural way to extend the the use of "#" symbol. First, to denote a tuple as a sequence of a given type of objects, let's use "(*type*:)". For example, to denote a tuple as a sequence of integers: "(i:)". The ":" is chosen to denote possible repetition of the whole pattern inside the parentheses. As for why choose ":", here is a bad story, but a good example:

Once upone a time there was a temple on a hill, and there lived an old and a young monk. One day the old monk told the the young a story:
Once upone a time there was a temple on a hill, and there lived an old and a young monk. One day the old monk told the the young a story:
...

OK, enough...but you got the idea :). Now consider 3 different scenarios of parsing tuples into arrays.

(1) Unkown size. Here are some examples:

{
//parsing a tuple and its size
int i, *t;
PyArg_ParseTuple(args, "(i:)#", &t, &i);
}

{
//parsing a matrix of i rows and j columns:
int i, j, **t;
PyArg_ParseTuple(args, "((i:):)##", &t, &i, &j);
}

Here is an informal explanation of "((i:):)##": the '#' can be seen as a peeling operator, which peels away the outer most ':'-parentheses pair "(.:)" that denote a "free sequence" (i.e. sequence of unknown size) with some pattern ".".

For types like 's' or its similar types, pretend as if there is an invisible pair of parentheses around it. For example, "(s:)##" matches a tuple of strings of some fixed length. Here the first '#' peels away the parentheses, leaving only "s#", which is equivalent to "(*char*:)#" in concept, where "*char*" denote the character type. As the second '#' is outside the tuple, it is shared by all strings inside, so they must have the same length.

If there is nothing to peel for a '#', it is considered a dangling '#', an error. Now consider some more interesting examples:

{
int i, *t, *v;
PyArg_ParseTuple(args, "(ii:)#", &t, &v, &i);
//args=(1,2,3,4,5,6)
//t={1,3,5}
//v={2,4,6}
//i=3
}

In that example, a pattern is provided inside the sequence "(ii:)", which can repeat itself multiple times. The repeated times is extracted by the '#' symbol. Below is another example:

{
int i,j, *t, **v;
PyArg_ParseTuple(args, "(i(i:):)##", &t, &v, &i, &j);
//args=(1,(3,),5,(7,))
//t={1,5}
//v={{3},{7}}
//i=2, j=1
}

And another one:

{
int i,j, **t, **v;
PyArg_ParseTuple(args, "((i:)(i:):)##", &t, &v, &i, &j);
//args=((1,),(3,),(5,),(7,))
//t={{1},{5}}
//v={{3},{7}}
//i=2, j=1
}

Note the second '#' is applied to both the free sequences inside the outer tuple. This indicates that all the inside tuples must be of the same length. What if they have their own common lengths? We provide a separate matching '#' for each inside tuple:

{
int i,j,k, **t, **v;
PyArg_ParseTuple(args, "((i:)(i:):)#(##)", &t, &v, &i, &j,&k);
//args=((1,),(3,4),(5,),(7,8))
//t={{1},{5}}
//v={{3,4},{7,8}}
//i=2, j=1, k=2
}

The syntax can also deal with more complicated situations, such as an irregular 2-dimensional array, where each row has a different row length:

{
int i, *j, **t;
PyArg_ParseTuple(args, "((i:)#:)#", &t, &j, &i);
for(a=0; a)
//t={{{1,2},{1,2}},{{1,2}}}
//i=2, k=2
//j={2,1}
}

This syntax can be used for parsing a tuple of strings:

{
int i, *j;
const char ** s;
PyArg_ParseTuple(args, "(s#:)#", &s, &j, &i);
//args=("ab", "cde")
//s={"ab", "cde"}
//j={2, 3}
//i=2
}

(2) If you know the size of a tuple before hand, use this syntax: "(*type*integer)", meaning there are exactly 'integer' number of *type* objects in the tuple.

{
int *t;
//
PyArg_ParseTuple(args, "(i5)", &t);
//args = (1,2,3,4,5)
//t = {1,2,3,4,5}
}

A complicated case:
{
int **t;
PyArg_ParseTuple(args, "((i5)2)", &t);
//args=((1,2,3,4,5),(5,6,7,8,9))
//t={{1,2,3,4,5},{5,6,7,8,9}}
}

Another complicated case:

{
int **t, **v, i;
PyArg_ParseTuple(args, "(i3i2:)#", &t,&v,&i);
//args=(1,2,3,4,5,6,7,8,9,10)
//t={{1,2,3},{6,7,8}}
//v={{4,5},{9,10}}
//i=2
}

One more thing: the difference between "(ii)" and "(i2)" is that the former yields two single integers, while the latter gives an array of length two.

(3) Concatenating arrays while parsing. Sometimes you would like to have a matrix parsed into a single array. Let's use a "(+" to start a tuple, such that the arrays parsed from the elements in that tuple should be concatenated into a single array.

{
int i,j, *t;
PyArg_ParseTuple(args, "(+(i))##", &t, &i, &j);
//args=((1,2,3),(4,5,6))
//t={1,2,3,4,5,6}
//i=2, j=3
}

Another example:

{
int i, *t;
PyArg_ParseTuple(args, "(+(i2))#", &t, &i);
}

Still another example, with strings:
{
int i, *j;
const char * t; //string
PyArg_ParseTuple(args, "(+(+s#2))#", &t, &j, &i);
//args=(("ab","c"),("d","efg"))
//t="abcdefg"
//i=2
////notice the size arrays are contatenated similarly
//j={2,1,1,3}

const char ** sa;
PyArg_ParseTuple(args, "(+(s2))", &sa);
//sa = {"ab","c","d","efg"}

PyArg_ParseTuple(args, "((+s2))", &sa);
//sa = { "abc", "defg" }
}

And finally, an example with multi-item pattern:

{
int *t, *v, i;
PyArg_ParseTuple(args, "(+i1i2:)#", &t, &v, &i);
//args=(1,2,3,4,5,6)
//t={1,4}
//v={2,3,5,6}
//i=2
}

Epilogue:

This syntax can be applied to lists: just use "[.:]" instead of "(.:)". It can also be used for building tuples or lists out of arrays.

That's it. Comments?