My Math Forum  

Go Back   My Math Forum > Science Forums > Computer Science

Computer Science Computer Science Forum


Reply
 
LinkBack Thread Tools Display Modes
February 18th, 2014, 02:17 PM   #11
Global Moderator
 
CRGreathouse's Avatar
 
Joined: Nov 2006
From: UTC -5

Posts: 16,046
Thanks: 938

Math Focus: Number theory, computational mathematics, combinatorics, FOM, symbolic logic, TCS, algorithms
Re: Data text compression

Ah, you're using it as a pre-compressor, not a compressor. Sure -- throwing away any information should make it easier to compress.
CRGreathouse is offline  
 
February 21st, 2014, 06:19 PM   #12
Banned Camp
 
Joined: Dec 2013

Posts: 1,117
Thanks: 41

Re: Data text compression

Quote:
Originally Posted by CRGreathouse
Ah, you're using it as a pre-compressor, not a compressor. Sure -- throwing away any information should make it easier to compress.
I did not say it was compressor.
But writing only the necessary letters you throw no information at all unless you did not understand what I meant.
My idea is probably a new idea never seen before.
People are screaming victory when they win 1% in pre-compressing the text.
I can assure you that with the 2 steps I suggested above : necessary letters and optimization strategy (context and other considerations) the text data compression will gain at least 10% maybe more.
Try to google data text pre-compression to understand what I mean.
I studied long time almost all the compression algorithms and I went to the conclusion that even random strings could be compressed. All depends on how you can regenerate information from few because there is always redondancy (there is redondancy on structures (as an example the syntax) sentences are ordered in some way which occur frequently. In any language you use there is redondancy.
Even random lottery draws have some redondancy.
mobel is offline  
February 22nd, 2014, 09:48 AM   #13
Global Moderator
 
CRGreathouse's Avatar
 
Joined: Nov 2006
From: UTC -5

Posts: 16,046
Thanks: 938

Math Focus: Number theory, computational mathematics, combinatorics, FOM, symbolic logic, TCS, algorithms
Re: Data text compression

Quote:
Originally Posted by mobel
But writing only the necessary letters you throw no information at all unless you did not understand what I meant.
You wrote: "In reality if we use a dictionary we will surely 80% of the stars right and 20% wrong." which suggests that you are throwing away information.
CRGreathouse is offline  
February 22nd, 2014, 10:40 AM   #14
Banned Camp
 
Joined: Dec 2013

Posts: 1,117
Thanks: 41

Re: Data text compression

Quote:
Originally Posted by CRGreathouse
Quote:
Originally Posted by mobel
But writing only the necessary letters you throw no information at all unless you did not understand what I meant.
You wrote: "In reality if we use a dictionary we will surely 80% of the stars right and 20% wrong." which suggests that you are throwing away information.
You did not understand what I meant.
To make a simulation I did something very simple and easy : I suppressed all the vowels considering that if we use a dictionary of necessary letters 20% of the vowels I suppressed are wrong. That is just an estimation.
Because when we use a dictionary of necessary letters even consonants are removed (that`s the job of the linguist to do it).
Examples in french :
the word "chien" we try to find some word such as if we replace some letters by stars (the stars are the x`s ), each star has only one solution.
chien = *hi**n
What is the meaning of that : when the 6 letters words is read by the program it will translate it as chien because no other 6 word letter have other solution than chien.
chion, phion, phien, phign, and so on do not exist in french language.
It is like when you play crosswords. Sometimes you do not need even to see the definition to write the right word.

Ps : I choose the example chien I did not think long time to do it. If there is any mistake from my part it will change the sense of what I`m suggesting.
No information is thrown. I can give 100`s of examples where we could replace letters by stars without loss of information.
mobel is offline  
February 22nd, 2014, 02:17 PM   #15
Global Moderator
 
CRGreathouse's Avatar
 
Joined: Nov 2006
From: UTC -5

Posts: 16,046
Thanks: 938

Math Focus: Number theory, computational mathematics, combinatorics, FOM, symbolic logic, TCS, algorithms
Re: Data text compression

Quote:
Originally Posted by mobel
No information is thrown. I can give 100`s of examples where we could replace letters by stars without loss of information.
Would you give a dozen or two?
CRGreathouse is offline  
February 23rd, 2014, 04:47 AM   #16
Banned Camp
 
Joined: Dec 2013

Posts: 1,117
Thanks: 41

Re: Data text compression

Quote:
Originally Posted by CRGreathouse
Quote:
Originally Posted by mobel
No information is thrown. I can give 100`s of examples where we could replace letters by stars without loss of information.
Would you give a dozen or two?
Here you can find a french " parametrical dictionary".
You enter the length of any word and the word itself with ? instead of * and you will have the answer (either one or more solution).
I entered some words used frequently (see here : http://fr.wiktionary.org/wiki/Wiktionna ... s_courants )

Examples :
At each word entered there is ALWAYS one solution

5 word letters : stylo = *t*lo
11 word letters : aujourd`hui = auj********

and so on.

I stopped because the website allow only some tries unless you register.

Here is how you can build a dictionary of necessary letters in french or any other language. You do not need to know french to do it.

First step : Data base building

1.Take an exhaustive list of all the french words
2. Sort the list by length (from 2 to k) assuming that k is the biggest length.
3. Create files by length (from 2 to k). It will help when searching. You do not need to search all the words.

Second : Necessary letters dictionary building

Each word of length l will require (2^l)-2 searches

Example of 5 letter word : xxxxx

Search : 30 cases

star is unknown variable
x is known letter

*xxxx
x*xxx
xx*xx
xxx*x
xxxx*

**xxx
*x*xx
*xx*x
*xxx*

x**xx
x*x*x

and so on from 1 star to 2 stars to 3 stars to 4 stars

The best strategy here is to start from 4 stars, then 3, then 2, then 1. If you find at some stage only one solution then the word is definitively accepted.

Example :
6 word letters : coccyx = c***yx (so in this case your search will stop at three)

You have to do what I described above for all the french words to build your dictionary of necessary letters.

So if you intent to send a software of pre-compression any user have to store the dictionary in his computer. You do not need to send the dictionary with

each compression. It is obvious. More than that any compression sofware can use such dictionary.

Building such dictionary will take you at most on day work.
If you build it then try to test some long text to see how many percentage you could gain.
If the gain is significative you could launch a start-up soon by developping it for other languages.
"Kickstarter" will surely help you to find a financial ressources. I will be happy to contribute anonymously.
Anyway I will be curious to know after simulation how many percentage you gain.
My idea works only for data text compression.
mobel is offline  
February 23rd, 2014, 07:20 AM   #17
Global Moderator
 
CRGreathouse's Avatar
 
Joined: Nov 2006
From: UTC -5

Posts: 16,046
Thanks: 938

Math Focus: Number theory, computational mathematics, combinatorics, FOM, symbolic logic, TCS, algorithms
Re: Data text compression

Yes, you can get some compression that way, as long as the message includes only words from that dictionary. But the methods I gave toward the start of the thread give more compression, essentially by taking the idea further.
CRGreathouse is offline  
February 23rd, 2014, 09:06 AM   #18
Banned Camp
 
Joined: Dec 2013

Posts: 1,117
Thanks: 41

Re: Data text compression

Quote:
Originally Posted by CRGreathouse
Yes, you can get some compression that way, as long as the message includes only words from that dictionary. But the methods I gave toward the start of the thread give more compression, essentially by taking the idea further.
I used gzip online to compress the 2 texts :

http://i-tools.org/gzip

With stars

494 bytes

Without stars

554 bytes

So the original text (1017 bytes) was compressed to 554 bytes (54.4%) and the new text with stars to 494 bytes (48.6%).
The is equal to 5.8% which is significative even if the text is short.
Imagine what the gain will be if you compress a text of one gigabyte.

My feeling is that you still did not get my idea.

Thank you for commenting.

Ps : I have another idea (a big revolution in pre compressing text) but too hard and too long to explain. I will keep it for my brother who is a high level programmer (he lives in Switzerland).
mobel is offline  
February 23rd, 2014, 09:25 AM   #19
Banned Camp
 
Joined: Dec 2013

Posts: 1,117
Thanks: 41

Re: Data text compression

An interesting article about pre compression :

http://citeseerx.ist.psu.edu/viewdoc/do ... 1&type=pdf
mobel is offline  
February 23rd, 2014, 12:15 PM   #20
Global Moderator
 
CRGreathouse's Avatar
 
Joined: Nov 2006
From: UTC -5

Posts: 16,046
Thanks: 938

Math Focus: Number theory, computational mathematics, combinatorics, FOM, symbolic logic, TCS, algorithms
Re: Data text compression

Quote:
Originally Posted by mobel
So the original text (1017 bytes) was compressed to 554 bytes (54.4%) and the new text with stars to 494 bytes (48.6%).
The is equal to 5.8% which is significative even if the text is short.
Imagine what the gain will be if you compress a text of one gigabyte.
Much, much less than the state of the art. See
http://mattmahoney.net/dc/text.html
CRGreathouse is offline  
Reply

  My Math Forum > Science Forums > Computer Science

Tags
compression, data, text



Thread Tools
Display Modes


Similar Threads
Thread Thread Starter Forum Replies Last Post
Data Compression through SVD philip Linear Algebra 0 June 22nd, 2013 04:13 AM
Rolling 3 months YOY data to Monthly YOY data lumpa Real Analysis 0 October 19th, 2012 08:07 AM
Matrix Compression and Comparison Hamid Behravan Linear Algebra 0 July 11th, 2011 07:40 AM
Compression technique for short number strings! myrv Algebra 6 January 30th, 2010 05:11 PM
Data Compression with Wavelets aminay Math Books 0 December 12th, 2009 05:23 AM





Copyright © 2019 My Math Forum. All rights reserved.