My Math Forum Data text compression

 Computer Science Computer Science Forum

 February 18th, 2014, 02:17 PM #11 Global Moderator     Joined: Nov 2006 From: UTC -5 Posts: 16,046 Thanks: 938 Math Focus: Number theory, computational mathematics, combinatorics, FOM, symbolic logic, TCS, algorithms Re: Data text compression Ah, you're using it as a pre-compressor, not a compressor. Sure -- throwing away any information should make it easier to compress.
February 21st, 2014, 06:19 PM   #12
Banned Camp

Joined: Dec 2013

Posts: 1,117
Thanks: 41

Re: Data text compression

Quote:
 Originally Posted by CRGreathouse Ah, you're using it as a pre-compressor, not a compressor. Sure -- throwing away any information should make it easier to compress.
I did not say it was compressor.
But writing only the necessary letters you throw no information at all unless you did not understand what I meant.
My idea is probably a new idea never seen before.
People are screaming victory when they win 1% in pre-compressing the text.
I can assure you that with the 2 steps I suggested above : necessary letters and optimization strategy (context and other considerations) the text data compression will gain at least 10% maybe more.
Try to google data text pre-compression to understand what I mean.
I studied long time almost all the compression algorithms and I went to the conclusion that even random strings could be compressed. All depends on how you can regenerate information from few because there is always redondancy (there is redondancy on structures (as an example the syntax) sentences are ordered in some way which occur frequently. In any language you use there is redondancy.
Even random lottery draws have some redondancy.

February 22nd, 2014, 09:48 AM   #13
Global Moderator

Joined: Nov 2006
From: UTC -5

Posts: 16,046
Thanks: 938

Math Focus: Number theory, computational mathematics, combinatorics, FOM, symbolic logic, TCS, algorithms
Re: Data text compression

Quote:
 Originally Posted by mobel But writing only the necessary letters you throw no information at all unless you did not understand what I meant.
You wrote: "In reality if we use a dictionary we will surely 80% of the stars right and 20% wrong." which suggests that you are throwing away information.

February 22nd, 2014, 10:40 AM   #14
Banned Camp

Joined: Dec 2013

Posts: 1,117
Thanks: 41

Re: Data text compression

Quote:
Originally Posted by CRGreathouse
Quote:
 Originally Posted by mobel But writing only the necessary letters you throw no information at all unless you did not understand what I meant.
You wrote: "In reality if we use a dictionary we will surely 80% of the stars right and 20% wrong." which suggests that you are throwing away information.
You did not understand what I meant.
To make a simulation I did something very simple and easy : I suppressed all the vowels considering that if we use a dictionary of necessary letters 20% of the vowels I suppressed are wrong. That is just an estimation.
Because when we use a dictionary of necessary letters even consonants are removed (thats the job of the linguist to do it).
Examples in french :
the word "chien" we try to find some word such as if we replace some letters by stars (the stars are the xs ), each star has only one solution.
chien = *hi**n
What is the meaning of that : when the 6 letters words is read by the program it will translate it as chien because no other 6 word letter have other solution than chien.
chion, phion, phien, phign, and so on do not exist in french language.
It is like when you play crosswords. Sometimes you do not need even to see the definition to write the right word.

Ps : I choose the example chien I did not think long time to do it. If there is any mistake from my part it will change the sense of what Im suggesting.
No information is thrown. I can give 100s of examples where we could replace letters by stars without loss of information.

February 22nd, 2014, 02:17 PM   #15
Global Moderator

Joined: Nov 2006
From: UTC -5

Posts: 16,046
Thanks: 938

Math Focus: Number theory, computational mathematics, combinatorics, FOM, symbolic logic, TCS, algorithms
Re: Data text compression

Quote:
 Originally Posted by mobel No information is thrown. I can give 100s of examples where we could replace letters by stars without loss of information.
Would you give a dozen or two?

February 23rd, 2014, 04:47 AM   #16
Banned Camp

Joined: Dec 2013

Posts: 1,117
Thanks: 41

Re: Data text compression

Quote:
Originally Posted by CRGreathouse
Quote:
 Originally Posted by mobel No information is thrown. I can give 100s of examples where we could replace letters by stars without loss of information.
Would you give a dozen or two?
Here you can find a french " parametrical dictionary".
You enter the length of any word and the word itself with ? instead of * and you will have the answer (either one or more solution).
I entered some words used frequently (see here : http://fr.wiktionary.org/wiki/Wiktionna ... s_courants )

Examples :
At each word entered there is ALWAYS one solution

5 word letters : stylo = *t*lo
11 word letters : aujourd`hui = auj********

and so on.

I stopped because the website allow only some tries unless you register.

Here is how you can build a dictionary of necessary letters in french or any other language. You do not need to know french to do it.

First step : Data base building

1.Take an exhaustive list of all the french words
2. Sort the list by length (from 2 to k) assuming that k is the biggest length.
3. Create files by length (from 2 to k). It will help when searching. You do not need to search all the words.

Second : Necessary letters dictionary building

Each word of length l will require (2^l)-2 searches

Example of 5 letter word : xxxxx

Search : 30 cases

star is unknown variable
x is known letter

*xxxx
x*xxx
xx*xx
xxx*x
xxxx*

**xxx
*x*xx
*xx*x
*xxx*

x**xx
x*x*x

and so on from 1 star to 2 stars to 3 stars to 4 stars

The best strategy here is to start from 4 stars, then 3, then 2, then 1. If you find at some stage only one solution then the word is definitively accepted.

Example :
6 word letters : coccyx = c***yx (so in this case your search will stop at three)

You have to do what I described above for all the french words to build your dictionary of necessary letters.

So if you intent to send a software of pre-compression any user have to store the dictionary in his computer. You do not need to send the dictionary with

each compression. It is obvious. More than that any compression sofware can use such dictionary.

Building such dictionary will take you at most on day work.
If you build it then try to test some long text to see how many percentage you could gain.
If the gain is significative you could launch a start-up soon by developping it for other languages.
"Kickstarter" will surely help you to find a financial ressources. I will be happy to contribute anonymously.
Anyway I will be curious to know after simulation how many percentage you gain.
My idea works only for data text compression.

 February 23rd, 2014, 07:20 AM #17 Global Moderator     Joined: Nov 2006 From: UTC -5 Posts: 16,046 Thanks: 938 Math Focus: Number theory, computational mathematics, combinatorics, FOM, symbolic logic, TCS, algorithms Re: Data text compression Yes, you can get some compression that way, as long as the message includes only words from that dictionary. But the methods I gave toward the start of the thread give more compression, essentially by taking the idea further.
February 23rd, 2014, 09:06 AM   #18
Banned Camp

Joined: Dec 2013

Posts: 1,117
Thanks: 41

Re: Data text compression

Quote:
 Originally Posted by CRGreathouse Yes, you can get some compression that way, as long as the message includes only words from that dictionary. But the methods I gave toward the start of the thread give more compression, essentially by taking the idea further.
I used gzip online to compress the 2 texts :

http://i-tools.org/gzip

With stars

494 bytes

Without stars

554 bytes

So the original text (1017 bytes) was compressed to 554 bytes (54.4%) and the new text with stars to 494 bytes (48.6%).
The is equal to 5.8% which is significative even if the text is short.
Imagine what the gain will be if you compress a text of one gigabyte.

My feeling is that you still did not get my idea.

Thank you for commenting.

Ps : I have another idea (a big revolution in pre compressing text) but too hard and too long to explain. I will keep it for my brother who is a high level programmer (he lives in Switzerland).

 February 23rd, 2014, 09:25 AM #19 Banned Camp   Joined: Dec 2013 Posts: 1,117 Thanks: 41 Re: Data text compression An interesting article about pre compression : http://citeseerx.ist.psu.edu/viewdoc/do ... 1&type=pdf
February 23rd, 2014, 12:15 PM   #20
Global Moderator

Joined: Nov 2006
From: UTC -5

Posts: 16,046
Thanks: 938

Math Focus: Number theory, computational mathematics, combinatorics, FOM, symbolic logic, TCS, algorithms
Re: Data text compression

Quote:
 Originally Posted by mobel So the original text (1017 bytes) was compressed to 554 bytes (54.4%) and the new text with stars to 494 bytes (48.6%). The is equal to 5.8% which is significative even if the text is short. Imagine what the gain will be if you compress a text of one gigabyte.
Much, much less than the state of the art. See
http://mattmahoney.net/dc/text.html

 Tags compression, data, text

 Thread Tools Display Modes Linear Mode

 Similar Threads Thread Thread Starter Forum Replies Last Post philip Linear Algebra 0 June 22nd, 2013 04:13 AM lumpa Real Analysis 0 October 19th, 2012 08:07 AM Hamid Behravan Linear Algebra 0 July 11th, 2011 07:40 AM myrv Algebra 6 January 30th, 2010 05:11 PM aminay Math Books 0 December 12th, 2009 05:23 AM

 Contact - Home - Forums - Cryptocurrency Forum - Top