Various systems and applications involve a large volume of duplicate items. Based on high data redundancy in real world datasets, data deduplication can reduce storage capacity and improve the utilization of network bandwidth. However, chunks of existing deduplications range in size from 4KB to over 16KB, existing systems are not applicable to the datasets consisting of short records. In this paper, we propose a new framework called SF-Dedup which is able to implement the deduplication process on a large set of Mobile Internet records, the size of records can be smaller than 100B, or even smaller than 10B. SF-Dedup is a short fingerprint, in-line, hash-collisions-resolved deduplication. Results of experimental applications illustrate that SH-Dedup is able to reduce storage capacity and shorten query time on relational database.

1.
John
Gantz
and
David
Reinsel
.
The 2011 digital universe study: Extracting value from chaos. IDC
,
2011
.
2.
William J.
Bolosky
,
Scott
Corbin
,
David
Goebel
, and
John R.
Douceur
.
2000
.
Single instance storage in Windows 2000
.
In Proceedings of the USENIX Windows System Symposium (WSS). USENIX, Berkeley,CA
,
1
12
.
3.
Sean
Quinlan
and
Sean
Dorward
.
2002
.
Venti: A new approach to archival storage
. In
Proceedings of the USENIX Conference on File and Storage Technologies (FAST)
.
USENIX, Berkeley, CA
,
1
13
.
4.
Bo
Hong
and
Darrell D. E.
Long
.
2004
.
Duplicate data elimination in a SAN file system
. In
Proceedings of the Conference on Mass Storage Systems (MSST
).
IEEE Computer Society
,
Washington, DC
,
301
314
.
5.
M. O.
Rabin
.
Fingerprinting by random polynomials
.
Technical report, Center for Research in Computing Technology
,
1981
.
6.
H. S.
Gunawi
,
N.
Agrawal
,
A. C.
Arpaci-Dusseau
,
R. H.
Arpaci-Dusseau
, and
J.
Schindler
.
Deconstructing commodity storage clusters
.
In Proceedings of the 32nd Int’l Symposium on Computer Architecture
, pages
60
71
, June
2005
.
7.
Calicrates
Policroniades
and
Ian
Pratt
.
2004
.
Alternatives for detecting redundancy in storage systems data
. In
Proceedings of the USENIX Annual Technical Conference (ATC)
.
USENIX
,
Berkeley, CA
,
73
86
.
8.
Athicha
Muthitacharoen
,
Benjie
Chen
, and
David
Mazieres
.
2001
.
A low-bandwidth network file system
.
In Proceedings of the Symposium on Operating Systems Principles (SOSP)
.
ACM, New York, NY
,
174
187
.
9.
Hollingsworth
,
J.
, and
Miller
,
E.
Using content-derived names for configuration management
. In
Proceedings of the 1997 Symposium on Software Reusability (SSR ’97)
(
Boston, MA
, May
1997
),
IEEE
, pp.
104
109
.
10.
Wang
,
X.
,
Yin
,
Y. L.
, and
Yu
,
H.
Finding collisions in the full SHA-1
.
Lecture Notes in Computer Science
3621
(
2005
),
17
36
.
11.
Frederik
Armknecht
,
Jens-Matthias
Bohli
,
Ghassan O.
Karame
and
Franck
Youssef
. In
Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security(CCS)
.
ACM, New York, NY
,
886
900
.
12.
Jenkins
,
Bob
.
SpookyHash: a 128-bit noncryptographic hash
. Retrieved Jan 29,
2012
.
This content is only available via PDF.