[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [SpeechIO-48] string substitution (was: Re: [SpeechIO-12] speechdv0.39)



> 
> Ewww.  Thank you.  Ugh.  
> 
> Wait, you're saying that
> 
>  $text =~ s/([a-zA-Z_0-9/]+)/$wordsub{$1} || $1/eg; 
> 
> Has the same problem ?  I don't think it does.. it's doing a hash lookup,
> not a search.  It'll take any contiguous group of characters in the above
> set (in []'s), and do a hash lookup.
> 
> if $wordsub{imo} == 'in my opinion', it'll get to "animosity", grab the
> whole thing at once (because it's composed entirely of characters in the
> set we're looking for) and nothing just before or after it (because that
> would be whitespace, which is not in the set), then do a hash lookup, and
> since $wordsub{animosity} == "", it'll hit the || $1/eg, and leave it
> alone.
> 
> Right ?

Yeah, I forgot that '+' was greedy, so it should work just fine.  Oops, my
perl must be a bit rusty...

> Then we'd just have to decide which characters to consider parts of words
> ("/", etc), and which characters not to (periods, commas, quotes, parens,
> etc).
> 
> Perl is really impressive.  It's cool how many rather elegant
> possibilities we have.  The frustration is in the irregularity of our
> language :)
> If punctuation were always just punctuation, and words were always made up
> only of alphabetic characters, this'd be easy :)

'/' isn't used as punctuation, right?  the only things that are are
[.,;:?] right?  Is that what we can break on?

something like an inversion set, for example:
  
  ([^.,;:? \t]+)

Would this be closer to what we want?  Hmm...this includes binary data,
so it's probably not what we want exactly...

> I did actually think about this before I put it in speechd.  I think this
> function deals with an issue that will be commong for all applications
> that use speechd.  And it can be disabled (by removing the speechd.sub
> file).
>
> Read over it.  Looks neat.  Still don't fully understand it though.
> 
> Didn't think about that.  The only reason I was thinking they should be
> stripped is to avoid like... what's that stuff called ?  Well, tainting
> problems.  Is that not the issue that I think it might be ?

If speechd is just a condiut, then we don't have to concern ourselves about
tainting problems -- as soon as we start filtering the data, then we have
to worry about tainting issues.  Multibyte would allow foriegn character
sets to pass through speechd -- if someone has written (or does in the future)
a festival like server that speaks french, then we'd already support it
if we pass multibyte characters -- which perl handles just fine by itself
\w is a multibyte character, not just an ascii character...

but if we move this stuff to catspeech, then speechd doen't have to worry
about it (or use the cpu do deal with it).  catspeech would, and that's
not run by root, so it'd be freindlier to the system as a whole.

> I'm very happy to have you back :)

thanks, I just hope my contributions are worthwhile and don't detract from
where you're trying to take things.

k

------------------------------------------------------------------------------
"We are all born originals -- why is it so many of us die copies?" 
    -- Edward Young
mortis@voicenet.com                            http://www.voicenet.com/~mortis
------------------------------------------------------------------------------