February 18th, 2014, 02:17 PM  #11 
Global Moderator Joined: Nov 2006 From: UTC 5 Posts: 16,046 Thanks: 938 Math Focus: Number theory, computational mathematics, combinatorics, FOM, symbolic logic, TCS, algorithms  Re: Data text compression
Ah, you're using it as a precompressor, not a compressor. Sure  throwing away any information should make it easier to compress.

February 21st, 2014, 06:19 PM  #12  
Banned Camp Joined: Dec 2013 Posts: 1,117 Thanks: 41  Re: Data text compression Quote:
But writing only the necessary letters you throw no information at all unless you did not understand what I meant. My idea is probably a new idea never seen before. People are screaming victory when they win 1% in precompressing the text. I can assure you that with the 2 steps I suggested above : necessary letters and optimization strategy (context and other considerations) the text data compression will gain at least 10% maybe more. Try to google data text precompression to understand what I mean. I studied long time almost all the compression algorithms and I went to the conclusion that even random strings could be compressed. All depends on how you can regenerate information from few because there is always redondancy (there is redondancy on structures (as an example the syntax) sentences are ordered in some way which occur frequently. In any language you use there is redondancy. Even random lottery draws have some redondancy.  
February 22nd, 2014, 09:48 AM  #13  
Global Moderator Joined: Nov 2006 From: UTC 5 Posts: 16,046 Thanks: 938 Math Focus: Number theory, computational mathematics, combinatorics, FOM, symbolic logic, TCS, algorithms  Re: Data text compression Quote:
 
February 22nd, 2014, 10:40 AM  #14  
Banned Camp Joined: Dec 2013 Posts: 1,117 Thanks: 41  Re: Data text compression Quote:
To make a simulation I did something very simple and easy : I suppressed all the vowels considering that if we use a dictionary of necessary letters 20% of the vowels I suppressed are wrong. That is just an estimation. Because when we use a dictionary of necessary letters even consonants are removed (that`s the job of the linguist to do it). Examples in french : the word "chien" we try to find some word such as if we replace some letters by stars (the stars are the x`s ), each star has only one solution. chien = *hi**n What is the meaning of that : when the 6 letters words is read by the program it will translate it as chien because no other 6 word letter have other solution than chien. chion, phion, phien, phign, and so on do not exist in french language. It is like when you play crosswords. Sometimes you do not need even to see the definition to write the right word. Ps : I choose the example chien I did not think long time to do it. If there is any mistake from my part it will change the sense of what I`m suggesting. No information is thrown. I can give 100`s of examples where we could replace letters by stars without loss of information.  
February 22nd, 2014, 02:17 PM  #15  
Global Moderator Joined: Nov 2006 From: UTC 5 Posts: 16,046 Thanks: 938 Math Focus: Number theory, computational mathematics, combinatorics, FOM, symbolic logic, TCS, algorithms  Re: Data text compression Quote:
 
February 23rd, 2014, 04:47 AM  #16  
Banned Camp Joined: Dec 2013 Posts: 1,117 Thanks: 41  Re: Data text compression Quote:
You enter the length of any word and the word itself with ? instead of * and you will have the answer (either one or more solution). I entered some words used frequently (see here : http://fr.wiktionary.org/wiki/Wiktionna ... s_courants ) Examples : At each word entered there is ALWAYS one solution 5 word letters : stylo = *t*lo 11 word letters : aujourd`hui = auj******** and so on. I stopped because the website allow only some tries unless you register. Here is how you can build a dictionary of necessary letters in french or any other language. You do not need to know french to do it. First step : Data base building 1.Take an exhaustive list of all the french words 2. Sort the list by length (from 2 to k) assuming that k is the biggest length. 3. Create files by length (from 2 to k). It will help when searching. You do not need to search all the words. Second : Necessary letters dictionary building Each word of length l will require (2^l)2 searches Example of 5 letter word : xxxxx Search : 30 cases star is unknown variable x is known letter *xxxx x*xxx xx*xx xxx*x xxxx* **xxx *x*xx *xx*x *xxx* x**xx x*x*x and so on from 1 star to 2 stars to 3 stars to 4 stars The best strategy here is to start from 4 stars, then 3, then 2, then 1. If you find at some stage only one solution then the word is definitively accepted. Example : 6 word letters : coccyx = c***yx (so in this case your search will stop at three) You have to do what I described above for all the french words to build your dictionary of necessary letters. So if you intent to send a software of precompression any user have to store the dictionary in his computer. You do not need to send the dictionary with each compression. It is obvious. More than that any compression sofware can use such dictionary. Building such dictionary will take you at most on day work. If you build it then try to test some long text to see how many percentage you could gain. If the gain is significative you could launch a startup soon by developping it for other languages. "Kickstarter" will surely help you to find a financial ressources. I will be happy to contribute anonymously. Anyway I will be curious to know after simulation how many percentage you gain. My idea works only for data text compression.  
February 23rd, 2014, 07:20 AM  #17 
Global Moderator Joined: Nov 2006 From: UTC 5 Posts: 16,046 Thanks: 938 Math Focus: Number theory, computational mathematics, combinatorics, FOM, symbolic logic, TCS, algorithms  Re: Data text compression
Yes, you can get some compression that way, as long as the message includes only words from that dictionary. But the methods I gave toward the start of the thread give more compression, essentially by taking the idea further.

February 23rd, 2014, 09:06 AM  #18  
Banned Camp Joined: Dec 2013 Posts: 1,117 Thanks: 41  Re: Data text compression Quote:
http://itools.org/gzip With stars 494 bytes Without stars 554 bytes So the original text (1017 bytes) was compressed to 554 bytes (54.4%) and the new text with stars to 494 bytes (48.6%). The is equal to 5.8% which is significative even if the text is short. Imagine what the gain will be if you compress a text of one gigabyte. My feeling is that you still did not get my idea. Thank you for commenting. Ps : I have another idea (a big revolution in pre compressing text) but too hard and too long to explain. I will keep it for my brother who is a high level programmer (he lives in Switzerland).  
February 23rd, 2014, 09:25 AM  #19 
Banned Camp Joined: Dec 2013 Posts: 1,117 Thanks: 41  Re: Data text compression
An interesting article about pre compression : http://citeseerx.ist.psu.edu/viewdoc/do ... 1&type=pdf 
February 23rd, 2014, 12:15 PM  #20  
Global Moderator Joined: Nov 2006 From: UTC 5 Posts: 16,046 Thanks: 938 Math Focus: Number theory, computational mathematics, combinatorics, FOM, symbolic logic, TCS, algorithms  Re: Data text compression Quote:
http://mattmahoney.net/dc/text.html  

Tags 
compression, data, text 
Thread Tools  
Display Modes  

Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Data Compression through SVD  philip  Linear Algebra  0  June 22nd, 2013 04:13 AM 
Rolling 3 months YOY data to Monthly YOY data  lumpa  Real Analysis  0  October 19th, 2012 08:07 AM 
Matrix Compression and Comparison  Hamid Behravan  Linear Algebra  0  July 11th, 2011 07:40 AM 
Compression technique for short number strings!  myrv  Algebra  6  January 30th, 2010 05:11 PM 
Data Compression with Wavelets  aminay  Math Books  0  December 12th, 2009 05:23 AM 