Task Spec

From RL-Glue

Revision as of 01:20, 26 September 2011 by Btanner (Talk | contribs)
(diff) ← Older revision | Current revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Note: This page needs cleanup. It was directly converted from the old Google Sites page in a hurry. It will take some step by step cleanup.

Contents

Overview

The task specification language (task spec) is explained in the RL-Glue overview documentation. The main idea is that the task specification is a string that is generated by the environment and provided to the agent in agent_init so that the agent knows what to expect, in terms of the type and dimensionality of observations, actions, and the range of rewards.

Over the time that RL-Glue has been evolving, the task specification language has been changing also. It actually had been quite a challenge to keep RL-Glue, the task spec version, and all of the task spec parsers for different languages in lock-step. This is a problem, especially since going forward from the RL-Glue 3.0 release, we hope that RL-Glue will be very stable and will require very infrequent updates.

To solve this problem, we've decided to move the task specification details out of the RL-Glue manual and to put them on this website. As new versions are standardized, they will appear here. It's usually ok to use old versions.

Task Specifications

Task Spec 3.0 (Fall 2008)

Final Specification

  • ALLCAPS means text as written
  • <lowercase> means a variable
  • [variable=x] means optional variable, and x will be the value if you leave it out.
  • * means that you can have 0 or more of these.

Task Spec RL-Glue 3.0 Template:

VERSION <version-name> PROBLEMTYPE <problem-type> DISCOUNTFACTOR <discount-factor> OBSERVATIONS INTS ([times-to-repeat-this-tuple=1] <min-value> <max-value>)* DOUBLES ([times-to-repeat-this-tuple=1] <min-value> <max-value>)* CHARCOUNT <char-count> ACTIONS INTS ([times-to-repeat-this-tuple=1] <min-value> <max-value>)* DOUBLES ([times-to-repeat-this-tuple=1] <min-value> <max-value>)* CHARCOUNT <char-count> REWARDS (<min-value> <max-value>) EXTRA [extra text of your choice goes here]";

INTS, DOUBLES, and CHARCOUNT are only necessary if there are some int ranges, double ranges, or if charCount>0

  • version-name : If using this task spec, should be RL-Glue-3.0. If you are making your own custom task spec, name it whatever you like. Letters, numbers, dashes, no spaces please.
  • problem-type: episodic | continuing | something-else.
  • discount-factor: continuous number in [0,1].
  • times-to-repeat-this-tupleL The number of times to repeat this tuple. For example, if you had a length 1024 bitmap, you could do (1024 0 1), instead of (0 1) (0 1) (0 1) , etc.
  • min-value: Depending on if this is an INT or DOUBLE, this can be an integer or a double. Special values allowed are NEGINF or UNSPEC.
  • max-value: Depending on if this is an INT or DOUBLE, this can be an integer or a double. Special values allowed are POSINF or UNSPEC.
  • char-count:Size of character array to expect.


Notes:

  • The character arrays don't have ranges, they are just characters. In theory this means they could be any ascii value. In practice, I'm sure this will lead to encoding issues if it becomes a popular feature.
  • This new task spec is more verbose and a bit longer in some cases, but we think its worth it to make it easier to parse and easier to read.
  • Observations and actions of the same type (int/double/char) have to be grouped together in the task spec. You cannot specify some int variables, then some doubles, then some chars, then more ints.
  • You specify each int or double as a range, and a range can count for multiple observations or actions (see above).
  • The extra field is meant to support second-level meta data and other more complex protocols (task spec inside a task spec).


Examples

VERSION RL-Glue-3.0 PROBLEMTYPE episodic DISCOUNTFACTOR 1 OBSERVATIONS INTS (3 0 1) DOUBLES (2 -1.2 0.5) (-.07 .07) CHARCOUNT 1024 ACTIONS INTS (0 4) REWARDS (-5.0 5.0) EXTRA some other stuff goes here

This would create a task spec with:

  • 3-dimensional integer observations, all {0,1}
  • 3-dimensional continuous observations. The first 2 are in [-1.2, .5], the third is in [-.07,.07]
  • 1024 character observations
  • 1 dimensional integer action, {0,1,2,3}
  • Min Reward -5.0, Max Reward 5.0
  • Extra: "some other stuff goes here"
VERSION RL-Glue-3.0 PROBLEMTYPE episodic DISCOUNTFACTOR 1 OBSERVATIONS INTS (UNSPEC 1) ACTIONS DOUBLES (NEGINF POSINF) CHARCOUNT 0 REWARDS (UNSPEC UNSPEC) EXTRA Name: Test Problem A

This would create a task spec with:

  • 1-dimensional integer observations, minimum value unspecified, maximum value 1.
  • 1-dimensional continuous action, valid values are -infinite to infinity
  • Reward range not specified
  • Extra: "Name: Test Problem A"
VERSION RL-Glue-3.0 PROBLEMTYPE episodic DISCOUNTFACTOR 1 OBSERVATIONS DOUBLES (-1.2 0.5) (-.07 .07) ACTIONS INTS (0 2) REWARDS (-1 0) EXTRA Name=Traditional-Mountain-Car Cutoff=None Random-Starts=True

This would create a task spec for Mountain Car:

  • 2-dimensional continuous observations, first in [-1.2, 0.5] (the position) and second in [-.07, .07] (the velocity)
  • 1-dimensional integer action, valid values are {0,1,2}
  • Rewards are all in [-1,0]
  • Extra text could be parsed (if meta data protocol was shared between environment and experiment or agent) to reveal the problem name and some parameter settings.

It might seem like a hassle to create these task specs. We are adding helper code in some languages. In Java, it will look something like this:

TaskSpecVRLGLUE3 sampleTSO=new TaskSpecVRLGLUE3();
sampleTSO.setEpisodic();
sampleTSO.setDiscountFactor(1);
sampleTSO.addDoubleObservation(new DoubleRange(-1.2, .5));
sampleTSO.addDoubleObservation(new DoubleRange(-.07,.07));
sampleTSO.addIntAction(new IntRange(0,3));
sampleTSO.setRewardRange(new DoubleRange(-1,0));
sampleTSO.setExtra("Name=Traditional-Mountain-Car Cutoff=None Random-Starts=True");

String theTaskSpec = sampleTSO.toTaskSpec();

You can also define your own task spec. For example, the real time strategy task spec had a dynamic number of actions and observations, so any standard task spec language will be woefully inadequate. In that case, do something like: VERSION Real-Time-Strategy-1.0 <then you can put whatever you want after and the standard task spec parsers will know not to bother trying>

Development Details

Currently under development. Some desiderata:

  • Easier to read by humans
  • Easier to parse
  • Extra area for unstructured side information
  • Add section for the environment name
  • Add ability for custom specs

Extra Considerations

Tetris In the Tetris domain used in the 2008 RL-Competition, we wanted to specify game-board width and height. There was no way to do this with a standard task spec, so we used a hack of adding two observations at each step, the width and the height, which never changed. This is an ugly fix.

Real Time Strategy For the RTS game used in the 2008 RL-Competition, the number of actions and observations was not fixed: it changed as the game progressed. The task spec sent across was bogus and ignored. There were some things that the environment wanted to tell the agent, but those ended up getting sent through the messaging system.

Quick'n'Dirty Proposal

  • Instead of numbering task spec versions: name them and encourage 3rd party specs

Official RL-Glue Spec:

  • Don't use colons or underscores, or commas. Just use white space.
  • Use round brackets for ranges
  • Don't allow parts of the spec to be optional/unspecified
  • Replace the idea of "unspecified bounds" with a special character. u?
  • Add an Xtra section at the end: freeform to the end of the strig
  • Add a "name" field that will be used in conjunction with the Xtra section to know what metadata to expect
  • Label each section
  • Don't Allow arbitrary ordering of ints, doublse, chars... make them group together.

So, instead of:

2.0:e:2_[f,f]_[-1.2,0.5]_[-.07,.07]:1_[i]_[0,2]:[-1,0]

We get:

version RL-Glue-3.0 problemtype episodic observations 0 int 2 double (-1.2 0.5) (-.07 .07) 0 char actions 1 int (0 2) 0 double 0 char reward (-1 0) name unspecified extra stuff!

And instead of:

2.0:e:2_[i,f]_[,]_[-inf,inf]:1_[i]_[0,2]:[-1,0]

We get:

version RL-Glue-3.0 problemtype episodic observations 1 int (u u) 1 double (-inf inf) 0 char actions 1 int (0 2) 0 double 0 char reward (-1 0) name unspecified extra

Tetris you could do:

version RL-Glue-3.0 problemtype episodic observations 200 int (0 1) (0 1) (0 1) /*200 of these*/ (0 1) 0 double 0 char actions 1 int (0 5) 0 double 0 char reward (0 16) name tetris-2.0 extra rows 10 cols 20

RTS you could do:

version RTS-1.0 <what follows here would be unconstrained and up to the RTS environment designer>


Why make such a big change? Linear parsing! You can literally just go through the task spec in a single pass, and read things in a linear fashion. There is no counting necessary, no checking for special cases. This should allow us to actually write and test all the task spec parsers in a couple of days.

Parser Contraints/Expectations

I propose that every task spec parser returns a struct/object from parsing the task spec, and that this struct/object can be turned back into the string. So you can write code like:

taskStuct = parseTaskSpec ( taskSpecString );
taskSpecStringCopy = serializeTaskStruct(taskStruct);

If we hold ourselves to the standard that before a parser is released, it can do those conversions, then we can come up with a few (hundred? :P) test task specs and make sure the parser can handle them all.

Task Spec 2.0 (Summer 2007)

Copied from: http://rlai.cs.ualberta.ca/RLBB/TaskSpecification.html
The ambition of this page is to present a specific proposal for a language for describing tasks -- agent-environment interfaces -- to be used as a Task_specification in calls from env_init and to agent_init in the RL-framework.

Task Description Language

In an effort to provide the agent writer with simple and concise information about the environment a Task_specification is passed from the environment, through the interface, to the agent. The environment's init method (env_init) encodes information about the problem in a ASCII string. The string is then passed to the agent's init method (agent_init).  This information can also be used to check that the agent and environment are suitable for each other. A few example Task_specifications are provided below.
The agent is responsible for parsing any relevant information out of the Task_specification in the init method. A generic Task_specification parsing function  is provided with RL-Glue 2.0 for all C/C++ users. This simple parser will return a structure containing the information encoded in the Task_specification, such as the Observation and Action dimensions, arrays of Observation and Action variable ranges, and arrays of Observation and Action variable types. More information about the parser and the parsed task spec struct can be found here.



Task_specification
The Task_specification is stored as a string with the following format:
        "V:E:O:A:R"
For example, this is a sample task_specification provided as one of the examples below:
        "2:e:1_[i]_[0,N-1]:1_[i]_[0,3]:[-1,0]"
The V corresponds to the version number of the task specification language. E corresponds to the type of task being solved. It has a character value of 'e' if the task is episodic and 'c' if the task is continuing. O and A correspond to Observation and Action information respectively. Finally, the R corresponds to the range of rewards for the task. Within each of O, A and R a range can be provided, however if the values are unknown or infinite in magnitude, two special input values have been defined.
The format of O and A are identical. We will describe the form of O only. O contains three components, separated by underscore characters ("_") : 
        #dimensions_dimensionTypes_'dimensionRanges

#dimensions is an integer value specifying the number of dimensions in the Observation space. dimensionTypes is a list specifying the type of each dimension variable. The dimensionTypes list is composed of #dimensions components separated by comma characters (",") within square brackets ([x1,x2,x3,..., xn] where xi represents the ith value). Each comma-separated value in the list describes the type of values assigned to each Observation variable in the environment. In general, Observation variables can have one of the following 2 types:

        'i' - integer value
        'f' - float value

Thus a dimensionTypes list corresponding to an Observation space with 1 dimension has the following form:


[a] where a is an element of ['i','f']
  

So a dimensionTypes list with one integer value would be: [i]

An Observation space with 2 dimensions would have a dimensionTypes with the following form:


[a,b] where a and b are elements of ['i','f']


indicating the value type of the first (a) and second (b) Observation variables. Thus a three dimensional Observation with one float, integer dimension and another float dimension would have the following dimensionTypes:
       [f,i,f]

The dimensionRanges is a list specifying the range of each dimension variable in the Observation space.  The dimensionRanges is composed of #dimensions components separated by underscore characters. Each dimensionRanges component specifies the upper and lower bound of values for each Observation variable. If the bounds are unknown or unspecified, you can leave an empty space in the place of a value. If the bounds are positive or negative infinity, you can use inf or -inf to represent your range. These can be used in combination. For example one valid range could be an unknown lower bound and infinite upper bound, or a lower bound of -inf and an upper bound of 1. You can be as precise (though you must be accurate) as you wish. A dimensionRanges corresponding to an Observation space with a single dimension variable would have the following form:

[O1MIN, O1MAX]

So a dimensionRanges list for binary dimension varaible would be: [0,1]

A dimensionRanges list for variable with no upper or lower bound unspecified would be: [] or [,] (both are valid).
A dimensionRanges with one or two unbounded value can take on a value of inf or -inf. Eg [0, inf] or [-inf,1] or [-inf,inf].

An Observation space with 2 dimensions would have a dimensionRanges with the following form:
        [O1MIN, O1MAX]_'[O2MIN, O2MAX]
indicating the minimal and maximal values of Observation variables O1 and O2 respectively. This definition can be then trivially extended to Observation spaces with N dimensions. 

NOTE: the dimensionRanges of an Observation space with 1 or more unbounded values may not be representable in this way. An unbounded value has no minimal or maximal range. Thus, we simply do not specify the range in the dimensionRanges for any Observation variables with unbounded values. For example, consider a problem with 3 Observation dimensions. The first and third Observation variables have interval values and the second has unbounded ratio value. The corresponding dimensionRanges for this problem is encoded as:

       [O1MIN, O1MAX]_'[,]_[O3MIN, O3MAX]
indicating the minimal and maximal values of Observation variables O1 and O3.

The format of A (Action space information) is identical to that of O (Observation space information) and thus the definitions above hold for Action spaces.
Lastly the R (Reward space information) is merely a range specifier. By the Reward Hypothesis there is only ever one reward signal (which in RL-Glue is always a floating point number) so the #dimensions and dimensionType information becomes meaningless. The reward range can again be specified to be unknown or infinite in the same manner as the Observation ranges.  A rewardRange follows the following form:
    
    [rewardMin, rewardMax]

In the case of a reward with rewards -1 or 0 the rewardRange would appear as such: [-1,0].

If no lower bound was known and the upper bound was positive infinity, the rewardRange would appear as such: [,-inf]

'


Example Task_specifications
Consider a simple gridworld with Actions North, South, East and West and a single dimension Observation of grid position. If we encode actions as 0, 1 ,2 ,3 and position as an integer between 0 and N-1, we get the following Task_specification:
        "2:e:1_[i]_[0,N-1]:1_[i]_[0,3]:[-1,0]"
This Task_specification provides the following information:
        - RL-Glue version 2.0 supported
        - the task is episodic
        - Observation space has one dimension
        - the Observation variable has integer values (discrete state)
        - range of Observation variable is 0 to N-1
        - Action space has one dimension
        - the Action variable has integer values (discrete actions >> tabular)
        - range of Action variable is 0 to 3
        - range of the rewards is -1 to 0
For a more complex illustration of the expressiveness of the Task_specification language, consider the Mountain Car problem. The Actions available to the agent are full throttle reverse, zero throttle, and full throttle forward. The Observation consists of the cars position and velocity. If we encode Actions as 0, 1, and 2 respectively and position and velocity as real values with finite ranges, we get the following Task_specification:
        "2.0:e:2_[f,f]_[-1.2,0.5]_[-.07,.07]:1_[i]_[0,2]:[-1,0]"
This Task_specification provides the following information:
        - RL-Glue version 2 supported
        - the task is episodic
        - Observation space has two dimensions
        - the first Observation variable has float values
        - the second Observation variable has float values (2D continuous state)
        - range of Observation variable one is -1.2 to 0.5
        - range of Observation variable two is -0.07 to 0.07
        - Action space has one dimension
        - the Action variable has integer values (discrete actions)
        - range of Action variable is 0 to 2
        - range of the Rewards is -1 to 0


A final example from a fabricated problem:
        "2.0:e:2_[i,f]_[,]_[-inf,inf]:1_[i]_[0,2]:[-1,0]"
This Task_specification provides the following information:
  • RL-Glue version 2 supported
  • the task is episodic
  • Observation space has two dimensions
  • the first Observation variable has integer values
  • the second Observation variable has float values
  • range of Observation variable one is unspecified
  • range of Observation variable two is unbounded
  • Action space has one dimension
  • the Action variable has integer values (discrete actions)
  • range of Action variable is 0 to 2
  • range of the Rewards is -1 to 0


Task Spec 1.0

Hopefully we can dig this up, for the memories :)
Personal tools