【2024管家婆正版免費資料大全】
【新澳門旺仔心水資料網(wǎng)址】
【香港六彩免費資料】
【澳門開獎結(jié)果開獎記錄表888】
【東方心經(jīng)免費公開資料】
【精選資料二四六天天好彩308KC0m】
【香港管家婆圖庫彩圖庫】
【澳門正版資料大全正版資料2】
【香港特馬資料免費大全贏錢買什么】
【新奧門資料大全免費香港】
【黃大仙精選一肖一碼開獎37b】
【123696澳門六下資料2024年玩法】
【管家婆內(nèi)部資料免費大全】
【澳門天天彩內(nèi)幕至尊二肖】
【新澳門資料查詢大全香港】
【香港白小香港白小相資料2024】
【新澳門官家婆2024年最準(zhǔn)一肖一碼】
【正板馬會免費資料大全】
【2024年澳門正版資料丈全免費】
【查2024新澳六碼開獎記錄】
【澳門正版四不像資料鳳凰】
【新澳門資料大全正版資料】
【管家婆一肖最準(zhǔn)一肖】
【澳門天天資料免費大全2024年】
【2024年新澳門正版資料費大全】



??新智元報道??

編輯:好困?庸庸
【新智元導(dǎo)讀】「微調(diào)你的模型,獲得比GPT-4更好的性能」不只是說說而已,而是真的可操作。最近,一位愿意動手的ML工程師就把幾個開源LLM調(diào)教成了自己想要的樣子。?

模型微調(diào)(Fine Tuning)指的是在已經(jīng)訓(xùn)練好的大語言模型(LLM)的基礎(chǔ)上,使用特定的數(shù)據(jù)集進(jìn)行進(jìn)一步訓(xùn)練,這種「站在巨人肩膀上」的做法,可以用比較小的數(shù)據(jù)集,比較低的訓(xùn)練成本獲得更好的收益,避免了重復(fù)造輪子。
在大模型時代,提示詞工程(Prompt Engineering)、模型微調(diào)和檢索增強生成(RAG)都是非常重要的能力,對于大多數(shù)個人使用者來說,掌握提示詞工程就夠用了,但如果想要在自己的服務(wù)中接入大模型,模型微調(diào)是必由之路,也因為其對于技能的更高要求,成為了ML工程師高手過招之地。
今日,一位ML工程師的微調(diào)模型就登上了HN熱榜,「我的微調(diào)模型擊敗了OpenAI的GPT-4」!
工程師Alex Strick van Linschoten在博客中指出,就他的測試數(shù)據(jù)而言,Mistral、Llama3和Solar LLM的微調(diào)模型都要比OpenAI的模型更準(zhǔn)確。

文章地址:https://mlops.systems/posts/2024-07-01-full-finetuned-model-evaluation.html#start-date-accuracy
對此,作者戲稱道,這篇文章的標(biāo)題可以直接改為:微調(diào)模型擊敗OpenAI,但評估過程實在痛苦。
他坦言,代碼量很多,運行速度也較慢。在這次微調(diào)中,他第一次為微調(diào)選擇而糾結(jié)權(quán)衡。如果沒有某種系統(tǒng)來處理這個問題,維護(hù)這一切的復(fù)雜性就會開始增加。
現(xiàn)在,我們來看一下這位工程師是如何做的。(詳細(xì)代碼請參見原文)

加載數(shù)據(jù)集

所有數(shù)據(jù)都存儲在Hugging Face Hub的一個公共倉庫中。
為了進(jìn)行這些評估,作者選擇使用數(shù)據(jù)集的測試部分,因為模型之前沒有接觸過這些數(shù)據(jù),這樣可以更好地評估模型在新數(shù)據(jù)上的表現(xiàn)。
test_dataset
Dataset({    features: ['name', 'eventrefnumber', 'text', 'StartDate', 'eventtype', 'province', 'citydistrict', 'village', 'targetgroup', 'commander', 'position', 'minkilled', 'mincaptured', 'capturedcharacterisation', 'killedcharacterisation', 'killq', 'captureq', 'killcaptureraid', 'airstrike', 'noshotsfired', 'dataprocessed', 'flagged', 'glossarymeta', 'minleaderskilled', 'minfacilitatorskilled', 'minleaderscaptured', 'minfacilitatorscaptured', 'leaderq'],    num_rows: 724})
首先在DataFrame中添加一個額外的列,然后對數(shù)據(jù)集中的每一行進(jìn)行預(yù)測。預(yù)測的副本存儲到這一列中,以避免重復(fù)執(zhí)行這個計算密集的步驟。
但首先需要將數(shù)據(jù)組裝成Pydantic對象,以便處理數(shù)據(jù)驗證。
[    IsafEvent(        name='5',        text='2013-01-S-025\n\nKABUL, Afghanistan (Jan. 25, 2013)\nDuring a security operation in Andar district, Ghazni province, yesterday, an Afghan and coalition force killed the Taliban leader, Alaudin. Alaudin oversaw a group of insurgents responsible for conducting remote-controlled improvised explosive device and small-arms fire attacks against Afghan and coalition forces. Prior to his death, Alaudin was planning attacks against Afghan National Police in Ghazni province.',        start_date=datetime.date(2013, 1, 24),        event_type={'insurgentskilled'},        province={'ghazni'},        target_group={'taliban'},        min_killed=1,        min_captured=0,        killq=True,        captureq=False,        killcaptureraid=False,        airstrike=False,        noshotsfired=False,        min_leaders_killed=1,        min_leaders_captured=0,        predictions={}    ),    IsafEvent(        name='2',        text='2011-11-S-034\nISAF Joint Command - Afghanistan\nFor Immediate Release\n\nKABUL, Afghanistan (Nov. 20, 2011)\nA coalition security force detained numerous suspected insurgents during an operation in Marjeh district, Helmand province, yesterday.  The force conducted the operation after receiving information that a group of insurgents were at a compound in the area.  After calling for the men inside to come out peacefully, the insurgents emerged and were detained without incident.',        start_date=datetime.date(2011, 11, 19),        event_type={'detention'},        province={'helmand'},        target_group={''},        min_killed=0,        min_captured=4,        killq=False,        captureq=True,        killcaptureraid=True,        airstrike=False,        noshotsfired=False,        min_leaders_killed=0,        min_leaders_captured=0,        predictions={}     )]
因此,當(dāng)進(jìn)行預(yù)測時,我們希望從模型中得到一個類似這樣的JSON字符串:
json_str = events[0].model_dump_json(exclude={"text", "predictions"})print(json_str)
{"name":"5","start_date":"2013-01-24","event_type":["insurgentskilled"],"province":["ghazni"],"target_group":["taliban"],"min_killed":1,"min_captured":0,"killq":true,"captureq":false,"killcaptureraid":false,"airstrike":false,"noshotsfired":false,"min_leaders_killed":1,"min_leaders_captured":0}
從使用GPT模型進(jìn)行完整評估開始,需要一個更復(fù)雜的提示詞才能獲得理想的結(jié)果。
由于GPT模型沒有經(jīng)過訓(xùn)練或微調(diào)來響應(yīng)微調(diào)模型的特定提示詞,因此我們不能直接使用相同的提示詞。
這帶來了一個有趣的問題:我們需要花多少精力在設(shè)計GPT提示詞上,才能達(dá)到微調(diào)模型的準(zhǔn)確度?換句話說,是否真的有辦法在接受不同提示詞的模型之間進(jìn)行公平的比較?
嘗試OpenAI的GPT-4和GPT-4 Turbo可以看到,為了讓GPT模型有機會與微調(diào)模型競爭,提示詞需要多長。
理想情況下,作者會在上下文中加入更多的示例,但他也不希望增加token的使用量。
from openai import OpenAIfrom rich import printimport jsonimport os

def query_openai(article_text: str, model: str) -> str: query = ( f"The following is a press release issued by ISAF (formerly operating in Afghanistan):\n{article_text}\n\n" "## Extraction request\n" "Please extract the following information from the press release:\n" "- The name of the event (summarising the event / text as a headline)\n" "- The start date of the event\n" "- The event type(s)\n" "- The province(s) in which the event occurred\n" "- The target group(s) of the event\n" "- The minimum number of people killed during the event\n" "- The minimum number of people captured during the event\n" "- Whether someone was killed or not during the event\n" "- Whether someone was captured or not during the event\n" "- Whether the event was a so-called 'kill-capture raid'\n" "- Whether an airstrike was used during the event\n" "- Whether no shots were fired during the event\n" "- The minimum number of leaders killed during the event\n" "- The minimum number of leaders captured during the event\n\n" "## Annotation notes:\n" "- A 'faciliator' is not a leader.\n" "- If a press release states that 'insurgents' were detained without further " "details, assign a minimum number of two detained. Interpret 'a couple' as " "two. Interpret 'several' as at least three, even though it may sometimes " "refer to seven or eight. Classify the terms 'a few', 'some', 'a group', 'a " "small group', and 'multiple' as denoting at least three, even if they " "sometimes refer to larger numbers. Choose the smaller number if no other " "information is available in the press release to come up with a minimally " "acceptable figure. Interpret 'numerous' and 'a handful' as at least four, " "and 'a large number' as at least five.\n\n" "## Example:\n" "Article text: 'ISAF Joint Command Evening Operational Update Feb. 19, 2011\nISAF Joint Command - " "Afghanistan\u20282011-02-S-143\u2028For Immediate Release \u2028\u2028KABUL, Afghanistan (Feb. 19)\u2028\u2028ISAF " "service members at a compound in Sangin district, Helmand province observed numerous insurgents north and south of " "their position talking on radios today. After gaining positive identification of the insurgent positions, the " "coalition troops engaged, killing several insurgents. Later, the ISAF troops observed more insurgents positioning " "in the area with weapons. After positive identification, coalition forces continued firing on the various insurgent " "positions, resulting in several more insurgents being killed.'\n\n" 'Output: `{"name":"Several insurgents killed in ' 'Helmand","start_date":"2011-02-18","event_type":["insurgentskilled"],"province":["helmand"],"target_group":[""],"mi' 'n_killed":6,"min_captured":0,"killq":true,"captureq":false,"killcaptureraid":false,"airstrike":false,"noshotsfired"' ':false,"min_leaders_killed":0,"min_leaders_captured":0}`' )
# set up the prediction harness client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
response = client.chat.completions.create( model=model, response_format={"type": "json_object"}, messages=[ { "role": "system", "content": "You are an expert at identifying events in a press release. You are precise " "and always make sure you are correct, drawing inference from the text of the " "press release.\n\n You always return a JSON string with the following schema: " "## JSON Schema details\n" "Here is some of the schema for the JSON output string you " "should make use of: event_types = ['airstrike', 'detention', " "'captureandkill', 'insurgentskilled', 'exchangeoffire', 'civiliancasualty'], " "provinces = ['badakhshan', 'badghis', 'baghlan', 'balkh', 'bamyan', " "'day_kundi', 'farah', 'faryab', 'ghazni', 'ghor', 'helmand', 'herat', " "'jowzjan', 'kabul', 'kandahar', 'kapisa', 'khost', 'kunar', 'kunduz', " "'laghman', 'logar', 'nangarhar', 'nimroz', 'nuristan', 'paktya', 'paktika', " "'panjshir', 'parwan', 'samangan', 'sar_e_pul', 'takhar', 'uruzgan', " "'wardak', 'zabul'], target_groups = ['taliban', 'haqqani', 'criminals', " "'aq', 'hig', 'let', 'imu', 'judq', 'iju', 'hik', 'ttp', 'other']\n\n", }, {"role": "user", "content": query}, ], temperature=1, )
return response.choices[0].message.content
可以通過一個簡單的示例來驗證這個函數(shù)是否正常工作:
json_str = query_openai(events[0].text, "gpt-4o")print(json.loads(json_str))
{    'name': 'Taliban leader Alaudin killed in Ghazni',    'start_date': '2013-01-24',    'event_type': ['insurgentskilled'],    'province': ['ghazni'],    'target_group': ['taliban'],    'min_killed': 1,    'min_captured': 0,    'killq': True,    'captureq': False,    'killcaptureraid': True,    'airstrike': False,    'noshotsfired': False,    'min_leaders_killed': 1,    'min_leaders_captured': 0}
模型正常工作(如預(yù)期的那樣),并且我們得到了一個JSON字符串。
接下來,構(gòu)建一個程序來遍歷所有測試數(shù)據(jù),獲取預(yù)測結(jié)果,并將這些預(yù)測結(jié)果存儲在Pydantic對象中。
對于批量預(yù)測,要確保以異步方式進(jìn)行,因為有大量的事件,所以不希望耗費太多時間。
此外,作者還在函數(shù)中添加了一些重試機制,以應(yīng)對GPT-3.5-turbo模型的速率限制。
正如我們現(xiàn)在所看到的,作者對每個事件都附加了三個預(yù)測結(jié)果。
print(events[0])
IsafEvent(    name='5',    text='2013-01-S-025\n\nKABUL, Afghanistan (Jan. 25, 2013)\nDuring a security operation in Andar district, Ghazni province, yesterday, an Afghan and coalition force killed the Taliban leader, Alaudin. Alaudin oversaw a group of insurgents responsible for conducting remote-controlled improvised explosive device and small-arms fire attacks against Afghan and coalition forces. Prior to his death, Alaudin was planning attacks against Afghan National Police in Ghazni province.',    start_date=datetime.date(2013, 1, 24),    event_type={'insurgentskilled'},    province={'ghazni'},    target_group={'taliban'},    min_killed=1,    min_captured=0,    killq=True,    captureq=False,    killcaptureraid=False,    airstrike=False,    noshotsfired=False,    min_leaders_killed=1,    min_leaders_captured=0,    predictions={'gpt-4o': '{\n  "name": "Taliban leader Alaudin killed in Ghazni",\n  "start_date": "2013-01-24",\n  "event_type": ["insurgentskilled", "captureandkill"],\n  "province": ["ghazni"],\n  "target_group": ["taliban"],\n "min_killed": 1,\n  "min_captured": 0,\n  "killq": true,\n  "captureq": false,\n  "killcaptureraid": true,\n  "airstrike": false,\n  "noshotsfired": false,\n  "min_leaders_killed": 1,\n  "min_leaders_captured": 0\n}',        'gpt-4-turbo': '{\n    "name": "Taliban leader Alaudin killed in Ghazni",\n    "start_date": "2013-01-24",\n    "event_type": ["captureandkill"],\n    "province": ["ghazni"],\n    "target_group": ["taliban"],\n    "min_killed": 1,\n    "min_captured": 0,\n    "killq": true,\n    "captureq": false,\n    "killcaptureraid": true,\n    "airstrike": false,\n    "noshotsfired": false,\n    "min_leaders_killed": 1,\n    "min_leaders_captured": 0\n}',        'gpt-3.5-turbo': '{\n    "name": "Taliban leader Alaudin killed in Ghazni province",\n    "start_date": "2013-01-24",\n    "event_type": ["captureandkill"],\n    "province": ["ghazni"],\n    "target_group": ["taliban"],\n    "min_killed": 1,\n    "min_captured": 0,\n    "killq": true,\n    "captureq": false,\n    "killcaptureraid": false,\n    "airstrike": false,\n    "noshotsfired": false,\n    "min_leaders_killed": 1,\n    "min_leaders_captured": 0\n}'    })
目前,已經(jīng)將所有預(yù)測結(jié)果都存儲在內(nèi)存中,所以現(xiàn)在是時候?qū)⑺鼈兲峤坏綌?shù)據(jù)集,并推送到Hugging Face Hub了,以防筆記本崩潰、本地計算機關(guān)閉或其他意外情況發(fā)生。
作者創(chuàng)建了一個函數(shù)來處理這個過程,因為還需要對其他模型重復(fù)這個步驟。雖然過程有點冗長,但這樣做更好,這樣方便我們可以清楚地看到每一步的操作。
一個更簡潔和抽象的 convert_to_dataset 函數(shù)可能如下所示:
def convert_to_dataset(data: List[BaseModel]) -> Dataset:    dataset_dict = {}
for field_name, field_value in data[0].__fields__.items(): field_type = field_value.outer_type_ if field_type in [str, int, float, bool, date]: dataset_dict[field_name] = [getattr(item, field_name) for item in data] elif field_type == set: dataset_dict[field_name] = [list(getattr(item, field_name)) for item in data] elif issubclass(field_type, BaseModel): dataset_dict[field_name] = [getattr(item, field_name).dict() for item in data] else: dataset_dict[field_name] = [getattr(item, field_name) for item in data]
dataset = Dataset.from_dict(dataset_dict) return dataset
不過現(xiàn)在,要先把數(shù)據(jù)推送到Hub上。
convert_and_push_dataset(events, "isafpressreleases_with_preds", split_name="test")

添加來自微調(diào)模型的預(yù)測


在添加完基線OpenAI模型之后,現(xiàn)在再來添加一些之前微調(diào)過的模型,包括本地模型以及由一鍵微調(diào)服務(wù)商托管的模型。

重新加載預(yù)測數(shù)據(jù)集

先加載數(shù)據(jù)集,然后再添加一些本地模型的預(yù)測結(jié)果:
from datasets import load_dataset
preds_test_data = load_dataset("strickvl/isafpressreleases_with_preds")[ "test"].to_list()

微調(diào)TinyLlama的預(yù)測

現(xiàn)在,如果我們檢查數(shù)據(jù)集,就會發(fā)現(xiàn)新的模型預(yù)測結(jié)果已經(jīng)保存進(jìn)去了:
from rich import print
print(preds_test_data[0])
{    'name': '5',    'text': '2013-01-S-025\n\nKABUL, Afghanistan (Jan. 25, 2013)\nDuring a security operation in Andar district, Ghazni province, yesterday, an Afghan and coalition force killed the Taliban leader, Alaudin. Alaudin oversaw a group of insurgents responsible for conducting remote-controlled improvised explosive device and small-arms fire attacks against Afghan and coalition forces. Prior to his death, Alaudin was planning attacks against Afghan National Police in Ghazni province.',    'predictions': {'gpt-3.5-turbo': '{\n    "name": "Taliban leader Alaudin killed in Ghazni province",\n    "start_date": "2013-01-24",\n    "event_type": ["captureandkill"],\n    "province": ["ghazni"],\n    "target_group": ["taliban"],\n    "min_killed": 1,\n    "min_captured": 0,\n    "killq": true,\n    "captureq": false,\n    "killcaptureraid": false,\n    "airstrike": false,\n    "noshotsfired": false,\n    "min_leaders_killed": 1,\n    "min_leaders_captured": 0\n}',        'gpt-4-turbo': '{\n    "name": "Taliban leader Alaudin killed in Ghazni",\n    "start_date": "2013-01-24",\n    "event_type": ["captureandkill"],\n    "province": ["ghazni"],\n    "target_group": ["taliban"],\n    "min_killed": 1,\n    "min_captured": 0,\n    "killq": true,\n    "captureq": false,\n    "killcaptureraid": true,\n    "airstrike": false,\n    "noshotsfired": false,\n    "min_leaders_killed": 1,\n    "min_leaders_captured": 0\n}',        'gpt-4o': '{\n  "name": "Taliban leader Alaudin killed in Ghazni",\n  "start_date": "2013-01-24",\n  "event_type": ["insurgentskilled", "captureandkill"],\n  "province": ["ghazni"],\n  "target_group": ["taliban"],\n "min_killed": 1,\n  "min_captured": 0,\n  "killq": true,\n  "captureq": false,\n  "killcaptureraid": true,\n  "airstrike": false,\n  "noshotsfired": false,\n  "min_leaders_killed": 1,\n  "min_leaders_captured": 0\n}',        'tinyllama-templatefree': '\n{"name":"Taliban leader killed in Ghazni","start_date":"2013-01-24","event_type":["insurgentskilled"],"province":["ghazni"],"target_group":["taliban"],"min_killed":1,"min_captured":0,"killq":true,"captureq":false,"killcaptureraid":false,"airstrike":false,"noshotsfired":false,"min_leaders_killed":1,"min_leaders_captured":0}',        'tinyllama-sharegpt': '{"name":"2","start_date":"2013-01-24","event_type":["airstrike"],"province":["ghazni"],"target_group":["taliban"],"min_killed":1,"min_captured":0,"killq":true,"captureq":false,"killcaptureraid":false,"airstrike":true,"noshotsfired":false,"min_leaders_killed":1,"min_leaders_captured":0}'    },    'start_date': datetime.date(2013, 1, 24),    'province': ['ghazni'],    'target_group': ['taliban'],    'event_type': ['insurgentskilled'],    'min_killed': 1,    'min_captured': 0,    'killq': True,    'captureq': False,    'killcaptureraid': False,    'airstrike': False,    'noshotsfired': False,    'min_leaders_killed': 1,    'min_leaders_captured': 0}

微調(diào)Mistral的預(yù)測

正如作者之前提到的,微調(diào)后的Mistral模型無法在本地運行,所以他在Modal上進(jìn)行推理,在那里他可以使用強大的A100顯卡來進(jìn)行預(yù)測。
我們會看到,模型的表現(xiàn)并不理想,幾乎所有評估都失敗了。在圖表中數(shù)值為零的就是mistral-lora-templatefree模型。

微調(diào)OpenAI的預(yù)測

使用OpenAI的一鍵微調(diào)服務(wù)對gpt-3.5-turbo-1106模型進(jìn)行了微調(diào)。通過OpenAI SDK 遍歷數(shù)據(jù)集,生成了這個微調(diào)模型的預(yù)測結(jié)果。

微調(diào)Mistral模型(通過OpenPipe)

使用OpenPipe微調(diào)了Mistral 7B和Mistral 8x7B模型,以便有一個合理的基準(zhǔn)來比較其他模型。

微調(diào)Solar LLM(通過Predibase)

大約一周前,Predibase宣布了一個新的頂級微調(diào)模型,即Upstage的Solar LLM,所以作者決定試試。
這個模型的優(yōu)勢在于它被訓(xùn)練得非常擅長人們常常微調(diào)模型的任務(wù),例如結(jié)構(gòu)化數(shù)據(jù)提取。正如在圖表中呈現(xiàn)的,它表現(xiàn)得相當(dāng)不錯!

微調(diào)Llama3的預(yù)測(通過OpenPipe)

作者本地微調(diào)的Llama3模型表現(xiàn)并不好,但在OpenPipe上的輸出看起來還可以,所以他使用這些預(yù)測進(jìn)行最終評估。
from rich import print
print(preds_test_data[0])
{    'name': '5',    'text': '2013-01-S-025\n\nKABUL, Afghanistan (Jan. 25, 2013)\nDuring a security operation in Andar district, Ghazni province, yesterday, an Afghan and coalition force killed the Taliban leader, Alaudin. Alaudin oversaw a group of insurgents responsible for conducting remote-controlled improvised explosive device and small-arms fire attacks against Afghan and coalition forces. Prior to his death, Alaudin was planning attacks against Afghan National Police in Ghazni province.',    'predictions': {'finetuned-llama3-7b-32k-openpipe': '{"name":"1","start_date":"2013-01-24","event_type":["insurgentskilled"],"province":["ghazni"],"target_group":["taliban"],"min_killed":1,"min_captured":0,"killq":true,"captureq":false,"killcaptureraid":true,"airstrike":false,"noshotsfired":false,"min_leaders_killed":1,"min_leaders_captured":0}',        'finetuned-mistral-7b-optimised-openpipe': '{"name":"1","start_date":"2013-01-24","event_type":["insurgentskilled"],"province":["ghazni"],"target_group":["taliban"],"min_killed":1,"min_captured":0,"killq":true,"captureq":false,"killcaptureraid":true,"airstrike":false,"noshotsfired":false,"min_leaders_killed":1,"min_leaders_captured":0}',        'finetuned-openai-gpt-3.5-turbo-1106': '{"name":"4","start_date":"2013-01-24","event_type":["insurgentskilled"],"province":["ghazni"],"target_group":["taliban"],"min_killed":1,"min_captured":0,"killq":true,"captureq":false,"killcaptureraid":true,"airstrike":false,"noshotsfired":false,"min_leaders_killed":1,"min_leaders_captured":0}',        'gpt-3.5-turbo': '{\n    "name": "Taliban leader Alaudin killed in Ghazni province",\n    "start_date": "2013-01-24",\n    "event_type": ["captureandkill"],\n    "province": ["ghazni"],\n    "target_group": ["taliban"],\n    "min_killed": 1,\n    "min_captured": 0,\n    "killq": true,\n    "captureq": false,\n    "killcaptureraid": false,\n    "airstrike": false,\n    "noshotsfired": false,\n    "min_leaders_killed": 1,\n    "min_leaders_captured": 0\n}',        'gpt-4-turbo': '{\n    "name": "Taliban leader Alaudin killed in Ghazni",\n    "start_date": "2013-01-24",\n    "event_type": ["captureandkill"],\n    "province": ["ghazni"],\n    "target_group": ["taliban"],\n    "min_killed": 1,\n    "min_captured": 0,\n    "killq": true,\n    "captureq": false,\n    "killcaptureraid": true,\n    "airstrike": false,\n    "noshotsfired": false,\n    "min_leaders_killed": 1,\n    "min_leaders_captured": 0\n}',        'gpt-4o': '{\n  "name": "Taliban leader Alaudin killed in Ghazni",\n  "start_date": "2013-01-24",\n  "event_type": ["insurgentskilled", "captureandkill"],\n  "province": ["ghazni"],\n  "target_group": ["taliban"],\n "min_killed": 1,\n  "min_captured": 0,\n  "killq": true,\n  "captureq": false,\n  "killcaptureraid": true,\n  "airstrike": false,\n  "noshotsfired": false,\n  "min_leaders_killed": 1,\n  "min_leaders_captured": 0\n}',        'mistral-lora-templatefree': '1',        'tinyllama-sharegpt': '{"name":"2","start_date":"2013-01-24","event_type":["airstrike"],"province":["ghazni"],"target_group":["taliban"],"min_killed":1,"min_captured":0,"killq":true,"captureq":false,"killcaptureraid":false,"airstrike":true,"noshotsfired":false,"min_leaders_killed":1,"min_leaders_captured":0}',        'tinyllama-templatefree': '\n{"name":"Taliban leader killed in Ghazni","start_date":"2013-01-24","event_type":["insurgentskilled"],"province":["ghazni"],"target_group":["taliban"],"min_killed":1,"min_captured":0,"killq":true,"captureq":false,"killcaptureraid":false,"airstrike":false,"noshotsfired":false,"min_leaders_killed":1,"min_leaders_captured":0}',        'ft-solar-1-mini-chat-240612-predibase': '\n\n{"name":"2","start_date":"2013-01-24","event_type":["insurgentskilled"],"province":["ghazni"],"target_group":["taliban"],"min_killed":1,"min_captured":0,"killq":true,"captureq":false,"killcaptureraid":true,"airstrike":false,"noshotsfired":false,"min_leaders_killed":1,"min_leaders_captured":0}'    },    'start_date': datetime.date(2013, 1, 24),    'province': ['ghazni'],    'target_group': ['taliban'],    'event_type': ['insurgentskilled'],    'min_killed': 1,    'min_captured': 0,    'killq': True,    'captureq': False,    'killcaptureraid': False,    'airstrike': False,    'noshotsfired': False,    'min_leaders_killed': 1,    'min_leaders_captured': 0}
現(xiàn)在我們有了來自七個微調(diào)模型和三個OpenAI模型的預(yù)測結(jié)果(用于比較),可以開始進(jìn)行評估了。
不過在此之前,還需要先進(jìn)行一個簡單的檢查,看看這些預(yù)測中有多少是有效的JSON格式。

JSON有效性測試

from datasets import load_dataset
dataset_with_preds = load_dataset("strickvl/isafpressreleases_test_predictions")[ "train"].to_list()
通過對比templatefree和sharegpt生成有效JSON的能力,我們已經(jīng)能看到它們在TinyLlama微調(diào)中的差異,這非常具有指導(dǎo)意義。
OpenAI模型每次都能生成有效的JSON,微調(diào)后的Mistral和Llama3模型也是如此。
在編寫評估模型的代碼時,作者注意到有些條目是空白的或者根本沒有預(yù)測結(jié)果,所以他接下來對這個問題進(jìn)行了調(diào)查。
# find out how many of the predictions are None values or empty stringsmissing_values = {    "gpt-4o": 0,    "gpt-4-turbo": 0,    "gpt-3.5-turbo": 0,    "tinyllama-templatefree": 0,    "tinyllama-sharegpt": 0,    "finetuned-openai-gpt-3.5-turbo-1106": 0,    "finetuned-llama3-7b-32k-openpipe": 0,    "mistral-lora-templatefree": 0,    "finetuned-mistral-7b-optimised-openpipe": 0,    "ft-solar-1-mini-chat-240612-predibase": 0,}
for row in dataset_with_preds: for model in row["predictions"]: if row["predictions"][model] is None or row["predictions"][model] == "": missing_values[model] += 1
print(missing_values)
{    'gpt-4o': 0,    'gpt-4-turbo': 0,    'gpt-3.5-turbo': 0,    'tinyllama-templatefree': 0,    'tinyllama-sharegpt': 38,    'finetuned-openai-gpt-3.5-turbo-1106': 0,    'finetuned-llama3-7b-32k-openpipe': 0,    'mistral-lora-templatefree': 0,    'finetuned-mistral-7b-optimised-openpipe': 0,    'ft-solar-1-mini-chat-240612-predibase': 0}
如果沒有缺失值,tinyllama-sharegpt模型將會有全部724個預(yù)測結(jié)果,并且都是有效的JSON。
現(xiàn)在我們可以進(jìn)入我們真正感興趣的部分:準(zhǔn)確性。作者將計算所有有意義的屬性的分?jǐn)?shù),然后展示模型比較的結(jié)果。
這些屬性包括:
  • start_date

  • province

  • target_group

  • event_type

  • min_killed

  • min_captured

  • killq

  • captureq

  • killcaptureraid

  • airstrike

  • noshotsfired

  • min_leaders_killed

  • min_leaders_captured
重要提示,對于接下來的所有圖表,總?cè)蝿?wù)數(shù)為724,所以這些數(shù)字是從724中得出的。

測試結(jié)果

開始日期準(zhǔn)確性

{    'gpt-4o': 527,    'gpt-4-turbo': 522,    'gpt-3.5-turbo': 492,    'tinyllama-templatefree': 231,    'tinyllama-sharegpt': 479,    'finetuned-openai-gpt-3.5-turbo-1106': 646,    'finetuned-llama3-7b-32k-openpipe': 585,    'mistral-lora-templatefree': 0,    'finetuned-mistral-7b-optimised-openpipe': 636,    'ft-solar-1-mini-chat-240612-predibase': 649}
Solar和微調(diào)的GPT-3.5模型在預(yù)測事件發(fā)生日期方面表現(xiàn)最佳。

省份準(zhǔn)確性

{    'gpt-4o': 649,    'gpt-4-turbo': 645,    'gpt-3.5-turbo': 595,    'tinyllama-templatefree': 335,    'tinyllama-sharegpt': 660,    'finetuned-openai-gpt-3.5-turbo-1106': 704,    'finetuned-llama3-7b-32k-openpipe': 707,    'mistral-lora-templatefree': 0,    'finetuned-mistral-7b-optimised-openpipe': 711,    'ft-solar-1-mini-chat-240612-predibase': 704}
分析發(fā)現(xiàn),微調(diào)后的模型實際上比OpenAI模型表現(xiàn)更好,只犯了少量錯誤。

目標(biāo)群體準(zhǔn)確性

在這里,可能會提到多個目標(biāo)群體,因此作者會根據(jù)模型預(yù)測的群體中有多少是正確的來給出一個滿分為1的分?jǐn)?shù)。
微調(diào)后的模型在目標(biāo)群體識別方面明顯優(yōu)于OpenAI。
不過,作者懷疑如果添加一些訓(xùn)練數(shù)據(jù)中沒有的新群體,模型的表現(xiàn)可能會下降。

事件類型準(zhǔn)確性

事件類型實際上是最難的類別之一,因為有些類別在語義上存在重疊,有時甚至連人工標(biāo)注者也難以區(qū)分。
再一次,微調(diào)后的模型在這方面表現(xiàn)得相當(dāng)好。

min_killed準(zhǔn)確性

在這些數(shù)字估計任務(wù)中,微調(diào)模型和OpenAI模型的表現(xiàn)差距突然縮小了。
雖然Mistral依然表現(xiàn)最佳,但優(yōu)勢并不明顯!而且OpenAI模型在這方面的表現(xiàn)非常出色,令人印象深刻。
作者猜測這是因為提示中有一整段解釋了用于標(biāo)注示例的標(biāo)準(zhǔn):
標(biāo)注說明:‘facilitator’ 不是領(lǐng)導(dǎo)者。如果新聞稿中提到‘叛亂分子’被拘留而沒有進(jìn)一步細(xì)節(jié),則分配至少兩名被拘留者。將‘a(chǎn) couple’解釋為兩人。將‘several’解釋為至少三人,盡管有時可能指七或八人。將‘a(chǎn) few’、‘some’、‘a(chǎn) group’、‘a(chǎn) small group’和‘multiple’解釋為至少三人,即使有時它們指代更多人數(shù)。如果新聞稿中沒有其他信息來提供一個最低可接受的數(shù)字,請選擇較小的數(shù)字。將‘numerous’和‘a(chǎn) handful’解釋為至少四人,而‘a(chǎn) large number’解釋為至少五人。

min_captured準(zhǔn)確性

killq準(zhǔn)確性

作者期望這些布爾屬性的準(zhǔn)確性非常高,基本上幾乎所有模型都能達(dá)到這一點。
不過,微調(diào)后的Mistral仍然擊敗了GPT-4o的最佳成績。

captureq準(zhǔn)確性

killcaptureraid準(zhǔn)確性

「kill-capture raid」是一種特定術(shù)語,在標(biāo)注時以特定方式使用。
OpenAI對作者如何進(jìn)行這些調(diào)用一無所知,這也解釋了他們在這里表現(xiàn)不佳的原因。

airstrike準(zhǔn)確性

noshotsfired準(zhǔn)確性

「noshootsfired」屬性指的是新聞稿中是否提到在某次突襲/事件中沒有開槍。(在某段時間內(nèi),新聞稿特別喜歡提到這一點。)
作者不太確定為什么OpenAI模型的表現(xiàn)與預(yù)期相反。
對此,可以想到的一些半擬人化解釋方式是,比如GPT-4類模型可能「過度思考」了這個標(biāo)簽,但需要更多的調(diào)查才能真正理解這一點。

min_leaders_killed準(zhǔn)確性

我們經(jīng)常聽說LLM在處理數(shù)字時表現(xiàn)不佳,或者會默認(rèn)使用某些值等等。但在這個任務(wù)中,所有模型都得到了如此高的分?jǐn)?shù)。
這應(yīng)該是大家一直在努力改進(jìn)的地方,并且成效顯著。不過,微調(diào)模型仍然表現(xiàn)最好。

min_leaders_captured準(zhǔn)確性

最終得分


讓我們將這些單項能力分?jǐn)?shù)相加,取平均值,得到我們模型在準(zhǔn)確性方面的最終分?jǐn)?shù)。
甚至連作者自己也感到驚訝,微調(diào)模型竟然超過了OpenAI的GPT類模型。甚至,連TinyLlama都比GPT-3.5 Turbo表現(xiàn)更好!
其中,表現(xiàn)最好的是Mistral-7B(在OpenPipe上微調(diào)),緊隨其后的是Solar LLM和Llama3-7B。
僅從分?jǐn)?shù)來看,任何想為結(jié)構(gòu)化數(shù)據(jù)提取微調(diào)模型的人都可以先從Mistral-7B、Solar 7B或Llama3-7B入手。它們在準(zhǔn)確性方面可能都差不多,而在模型服務(wù)、效率和延遲方面可能有不同的權(quán)衡。
作者認(rèn)為,雖然在提示中再加入一些示例(以及更多的解釋和規(guī)則),就可以讓OpenAI的模型表現(xiàn)得更好,但擁有自己的微調(diào)模型也會有其他好處:
  • 數(shù)據(jù)隱私(不會將你的信息發(fā)送給 OpenAI)
  • 更小的模型很可能意味著更好的性能(盡管我仍需測試和證明這一點)
  • 整體上更多的控制
  • 成本改進(jìn)
在成本方面,現(xiàn)在進(jìn)行比較或聲明有點困難,尤其是考慮到大型云服務(wù)供應(yīng)商可以依賴的規(guī)模經(jīng)濟。
但在一個需要長期重復(fù)推理構(gòu)建這個模型的實際用例中,更有可能讓成本論點成立。
特別是,因為唯一能讓OpenAI推理調(diào)用變得更好的方法是填充大量示例和額外解釋,這顯著增加了每次查詢的成本。
話雖如此,微調(diào)模型時確實會出現(xiàn)一些真實的權(quán)衡。

微調(diào)效果顯著,但是……

首先,作者非常高興地發(fā)現(xiàn),常說的「微調(diào)你的模型,獲得比GPT-4更好的性能」實際上是真的!不僅是真的,而且只需要相對較少的調(diào)整和適應(yīng)。
要知道,上述所有模型都是用現(xiàn)有數(shù)據(jù)進(jìn)行的第一次微調(diào)?;旧现皇鞘褂昧怂心J(rèn)設(shè)置,因此它們開箱即用。
對于下一步的工作,作者表示,他將專注于表現(xiàn)最好的Solar、Llama3和Mistral 7B模型。

評估過程很痛苦

作者有一些在本地工作的模型,還有一些部署在不同環(huán)境和不同服務(wù)中的其他模型。
讓模型分布在不同地方確實很麻煩。在理想情況下,你會希望所有模型的推理都有一個標(biāo)準(zhǔn)接口,尤其是當(dāng)它們用于相同的用例或項目時。很方便的是,作者的微調(diào)GPT3.5自動由OpenAI部署和服務(wù),Llama3、Solar和Mistral也是如此,但作者希望有一個地方可以看到它們?nèi)俊?/span>
當(dāng)你有多個模型在運行,并且你在微調(diào)和更新它們,數(shù)據(jù)也在不斷變化時,你就需要一種管理這一切的方法。
這也是微調(diào)大語言模型的主要挑戰(zhàn)之一,你必須管理所有這些東西,以確保它們可靠且可重復(fù)地工作。即使在項目的早期階段,也需要一種方法來保持一切井然有序不出錯。

但可以了解是否在進(jìn)步

盡管評估的實施過程有些痛苦,但它們給了作者一個重要的工具,那就是他現(xiàn)在有了一種特定任務(wù)的方法,來判斷訓(xùn)練數(shù)據(jù)或模型的任何改進(jìn)或優(yōu)化是否在幫助自己前進(jìn)。沒有這個工具,那基本上就是在盲目摸索。

下一步是什么


作者最初想要訓(xùn)練多個模型,讓它們在各自的領(lǐng)域都成為超級專家,例如有一個模型非常擅長估算某個特定事件中有多少人被捕。
然而,看到模型的表現(xiàn),作者不太確定下一步還應(yīng)不應(yīng)該這樣做,或者說,不太確定自己是否真的能夠通過這種方法顯著提高準(zhǔn)確性。
接下來要做的,是進(jìn)行一些與準(zhǔn)確性無關(guān)的測試評估。比如,看看模型在域外數(shù)據(jù)(即完全虛構(gòu)的關(guān)于完全不同的事情的數(shù)據(jù))上的表現(xiàn)如何。
另一件事,是深入研究模型服務(wù)的一些細(xì)節(jié)。作者想拿出他前三名表現(xiàn)最佳的模型,深入研究大語言模型服務(wù)是如何完成的。
但在進(jìn)入下一步之前,還要等他的興奮勁退卻一些——「目前,我只是很高興微調(diào)后的模型擊敗了GPT-4!」
參考資料:
https://mlops.systems/posts/2024-07-01-full-finetuned-model-evaluation.html#start-date-accuracy