AitBAD commited on
Commit
d82cca2
·
verified ·
1 Parent(s): fdc84a4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +62 -0
README.md CHANGED
@@ -12,4 +12,66 @@ license: apache-2.0
12
  short_description: This Space provides a web interface for Optical Character Re
13
  ---
14
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
12
  short_description: This Space provides a web interface for Optical Character Re
13
  ---
14
 
15
+ # Asemmezdey Asekdan n Teqbaylit - Kabyle OCR
16
+
17
+ By Bouaziz Ait Driss
18
+
19
+ This Space provides a web interface for Optical Character Recognition (OCR) tailored for the Taqbaylit (Kabyle) language
20
+ using a custom Tesseract model ('kab.traineddata') with support for
21
+ special characters (ɣ, ɛ, ḍ, ṭ, ḥ, ṛ, ṣ, ẓ, ǧ, č).
22
+
23
+ ## Features
24
+
25
+ * Upload PDF, PNG, JPG, or JPEG files.
26
+ * Perform OCR using the custom 'kab' model.
27
+ * Preview documents (for PDFs).
28
+ * Edit the extracted text.
29
+ * Download the final text as a UTF-8 encoded `.txt` file.
30
+ * Adjust display DPI and font size for better user experience.
31
+
32
+ ## How to Use
33
+
34
+ 1. Upload a file using the sidebar.
35
+ 2. Click "Sekker PDF (Askan n Yisebtar)" if it's a PDF to load previews.
36
+ 3. Click "Sekker OCR" to start the OCR process.
37
+ 4. Edit the text in the right panel if needed.
38
+ 5. Download the final text using the "Zdem Aḍris" button.
39
+
40
+ ## Known Limitations
41
+
42
+ * Numbers: Limited training data.
43
+ * Some old less used characters such as "Г" equivalent to "ɣ" and "ţ" equivalent to "tt".
44
+ * Performance degrades with poor scan quality.
45
+ * Best results on printed text (not handwritten).
46
+
47
+ ==============================================================================
48
+
49
+ English will follow
50
+
51
+ Annar-a d afecku iteddun deg uẓeṭṭa n internet i usemmezdey aseklan n Teqbaylit (OCR). Yettunefk-d ilmend n tutlayt Taqbaylit.
52
+ Yebna ɣef tmudemt Tesseract ('kab.traineddata') ideg kkin yisekkilen n Teqbaylit / Tamaziɣt (ɣ, ɛ, ḍ, ṭ, ḥ, ṛ, ṣ, ẓ, ǧ, č).
53
+
54
+ Tiwura
55
+
56
+ * Sali afaylu PDF, PNG, JPG, neɣ JPEG
57
+ * Sekker OCR suseqdec n tmudemt 'kab'.
58
+ * Sekker PDF i uskan n yisebtar.
59
+ * Zṛeg aḍris, seɣti tira-s ma ilaq.
60
+ * Zdem aḍris s talɣa UTF8, afaylu `.txt`.
61
+ * Beddel DPI n uskan akked tiddi n yisekkilen.
62
+
63
+ Amek iteddu
64
+
65
+ 1. Sal afaylu seg ufeggag n yifecka
66
+ 2. Tekki ɣef "Sekker PDF (Askan n Yisebtar)" ma d aPDF akken ad d-iban.
67
+ 3. Tekki ɣef "Sekker OCR" akken ad yebdu usemmezdey n yisekkilen (OCR).
68
+ 4. Zṛeg aḍris i d-yettkaden deg usfaylu yeffes ma ilaq.
69
+ 5. Tekki ɣef "Zdem Aḍris" akken ad d-yeḥrez ufaylu.
70
+
71
+ Ayen ixuṣṣen
72
+
73
+ * Amḍan: Ur yemmid ara uselmed ɣef yimḍanen.
74
+ * Kra isekkilen iqburen ur ten-yesemmezdey (ɛeqqel) ara am "Г" yettwarun "ɣ" akked "ţ" yettwarun "tt".
75
+ * Tamellit tɣelli mi ara yeɣli umerkid n uskan n tugniwin.
76
+ * Asufeɣ n usemmezdey n uḍris ad yelhu i yiḍrisen yettḍebɛen (anagar ayen yuran s uɣanib ufus).
77
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference