{"id":6473,"date":"2021-11-30T13:36:21","date_gmt":"2021-11-30T13:36:21","guid":{"rendered":"https:\/\/sadilar.org\/parallel-corpora-for-english-into-isixhosa\/"},"modified":"2021-11-30T13:36:21","modified_gmt":"2021-11-30T13:36:21","slug":"parallel-corpora-for-english-into-isixhosa","status":"publish","type":"post","link":"https:\/\/sadilar.org\/en\/parallel-corpora-for-english-into-isixhosa\/","title":{"rendered":"Parallel corpora for English-isiXhosa and English-Siswati"},"content":{"rendered":"<div class=\"googlefontscall\"><\/div>\n<div class=\"pagebuilderckparams\" data-colorpalettefromtemplate=\"\" data-colorpalettefromsettings=\",,,,\" data-styles=\"\"><\/div>\n<div class=\"rowck ckstack3 ckstack2 ckstack1 uick-sortable\" id=\"row_ID1638279331714\" data-gutter=\"2%\" data-nb=\"1\" style=\"position: relative;\">\n<style class=\"ckcolumnwidth\">[data-gutter=\"2%\"][data-nb=\"1\"]:not(.ckadvancedlayout) [data-width=\"100\"] {width:100%;}[data-gutter=\"2%\"][data-nb=\"1\"].ckadvancedlayout [data-width=\"100\"] {width:100%;}<\/style>\n<div class=\"inner animate clearfix\">\n<div class=\"blockck\" id=\"block_ID1638279331714\" data-real-width=\"100%\" data-width=\"100\" style=\"position: relative;\">\n<div class=\"ckstyle\"><\/div>\n<div class=\"inner animate resizable\">\n<div class=\"innercontent uick-sortable\">\n<div id=\"ID1638279331735\" class=\"cktype\" data-type=\"text\" style=\"position: relative;\">\n<div class=\"tab_effects ckprops\" fieldslist=\"\"><\/div>\n<div class=\"tab_blocstyles ckprops\" blocbackgroundpositionend=\"100\" blocbackgrounddirection=\"topbottom\" blocbackgroundimageattachment=\"scroll\" blocbackgroundimagerepeat=\"no-repeat\" blocbackgroundimagesize=\"auto\" blocbordertopstyle=\"solid\" blocborderrightstyle=\"solid\" blocborderbottomstyle=\"solid\" blocborderleftstyle=\"solid\" blocbordersstyle=\"solid\" blocshadowinset=\"0\" fieldslist=\"blocbackgroundpositionend,blocbackgrounddirection,blocbackgroundimageattachment,blocbackgroundimagerepeat,blocbackgroundimagesize,blocalignementleft,blocalignementcenter,blocalignementright,blocalignementjustify,blocbordertopstyle,blocborderrightstyle,blocborderbottomstyle,blocborderleftstyle,blocbordersstyle,blocshadowinset\"><\/div>\n<div class=\"tab_edition ckprops\" fieldslist=\"\"><\/div>\n<div class=\"ckstyle\">\n<style><\/style>\n<\/div>\n<div class=\"cktext inner\" style=\"position: relative;\" spellcheck=\"false\">\n<p><strong>Project Type: <\/strong>Node<br \/><strong>Project Start Date: <\/strong>1 July 2019<br \/><strong>Project Status: <\/strong>Completed and delivered<\/p>\n<p><span style=\"text-decoration: underline;\" data-mce-style=\"text-decoration: underline;\"><strong>English-Siswati corpus<\/strong><\/span><\/p>\n<p><strong>Project Aims:<\/strong><\/p>\n<p>This project entailed the collection and processing of bilingual data to develop a 2-million-word English\u2013Siswati parallel-aligned corpus that can be used to train machine translation systems. The data was acquired by crawling various South African web domains and human translation, both sources accounting for roughly 50% of the final corpus.&nbsp;<\/p>\n<p>A 1,5-million-word monolingual corpus for Siswati was also created and packaged with the parallel corpus as an additional value-added deliverable.<\/p>\n<p><strong>Project Deliverables:<\/strong><\/p>\n<ul>\n<li>2 million words <a href=\"https:\/\/repo.sadilar.org\/handle\/20.500.12185\/560\" data-mce-href=\"https:\/\/repo.sadilar.org\/handle\/20.500.12185\/560\">parallel corpus English-Siswati<\/a><\/li>\n<li>1,5 million words <a href=\"https:\/\/repo.sadilar.org\/handle\/20.500.12185\/559\" data-mce-href=\"https:\/\/repo.sadilar.org\/handle\/20.500.12185\/559\">monolingual corpus Siswati<\/a><\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<p><span style=\"text-decoration: underline;\" data-mce-style=\"text-decoration: underline;\"><strong>English-isiXhosa corpus<\/strong><\/span><\/p>\n<p><strong>Project Aims:<\/strong><\/p>\n<p class=\"paragraph\" style=\"vertical-align: baseline;\" data-mce-style=\"vertical-align: baseline;\"><span class=\"normaltextrun\"><span lang=\"EN-ZA\" style=\"font-size: 10.0pt; font-family: 'Arial',sans-serif;\" data-mce-style=\"font-size: 10.0pt; font-family: 'Arial',sans-serif;\">In this project, a 1,85-million-word parallel corpus for English-isiXhosa was developed. The bulk of the data (80%) was collected from various South African (mainly government) web domains. The remainder of the data contains data sourced for the DSAC-funded Autshumato project that was not previously released. The corpus is aligned on sentence level and can be used for machine translation system development.<\/span><\/span><\/p>\n<p class=\"paragraph\" style=\"vertical-align: baseline;\" data-mce-style=\"vertical-align: baseline;\"><span class=\"normaltextrun\"><span lang=\"EN-ZA\" style=\"font-size: 10.0pt; font-family: 'Arial',sans-serif;\" data-mce-style=\"font-size: 10.0pt; font-family: 'Arial',sans-serif;\">In addition, a 2,5-million-word monolingual isiXhosa corpus has also been made available.<\/span><\/span><\/p>\n<p class=\"paragraph\" style=\"vertical-align: baseline;\" data-mce-style=\"vertical-align: baseline;\"><span class=\"normaltextrun\"><span lang=\"EN-ZA\" style=\"font-size: 10.0pt; font-family: 'Arial',sans-serif;\" data-mce-style=\"font-size: 10.0pt; font-family: 'Arial',sans-serif;\"><strong>Project Deliverables:<\/strong><\/span><\/span><\/p>\n<ul>\n<li>1,85 million words <a href=\"https:\/\/repo.sadilar.org\/handle\/20.500.12185\/525\" data-mce-href=\"https:\/\/repo.sadilar.org\/handle\/20.500.12185\/525\">parallel corpus for English-isiXhosa<\/a><\/li>\n<li>2,5 million words <a href=\"https:\/\/repo.sadilar.org\/handle\/20.500.12185\/524\" data-mce-href=\"https:\/\/repo.sadilar.org\/handle\/20.500.12185\/524\">monolingual corpus isiXhosa<\/a><\/li>\n<\/ul>\n<p class=\"paragraph\" style=\"vertical-align: baseline;\" data-mce-style=\"vertical-align: baseline;\">&nbsp;<strong>Contact details:<\/strong><\/p>\n<p>Please contact&nbsp;<a href=\"mailto:ctext@nwu.ac.za\" data-mce-href=\"mailto:ctext@nwu.ac.za\">ctext@nwu.ac.za<\/a>&nbsp;<\/p>\n<\/div><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"ckstyle\"><\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Project Type: NodeProject Start Date: 1 July 2019Project Status: Completed and delivered English-Siswati corpus Project Aims: This project entailed the collection and processing of bilingual data to develop a 2-million-word English\u2013Siswati parallel-aligned corpus that can be used to train machine translation systems. The data was acquired by crawling various South African web domains and human [&hellip;]<\/p>\n","protected":false},"author":246,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[730],"tags":[],"class_list":["post-6473","post","type-post","status-publish","format-standard","hentry","category-general"],"acf":[],"_links":{"self":[{"href":"https:\/\/sadilar.org\/en\/wp-json\/wp\/v2\/posts\/6473","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sadilar.org\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sadilar.org\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sadilar.org\/en\/wp-json\/wp\/v2\/users\/246"}],"replies":[{"embeddable":true,"href":"https:\/\/sadilar.org\/en\/wp-json\/wp\/v2\/comments?post=6473"}],"version-history":[{"count":0,"href":"https:\/\/sadilar.org\/en\/wp-json\/wp\/v2\/posts\/6473\/revisions"}],"wp:attachment":[{"href":"https:\/\/sadilar.org\/en\/wp-json\/wp\/v2\/media?parent=6473"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sadilar.org\/en\/wp-json\/wp\/v2\/categories?post=6473"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sadilar.org\/en\/wp-json\/wp\/v2\/tags?post=6473"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}