如何将文本文件读入matlab并使其成为列表？ - MATLAB爱好者论坛-LabFans.com

poster · 2019-12-10, 20:41

我有一个文本文件，其格式为

gene complement(22995..24539) /gene="ppp" /locus_tag="MRA_0020" CDS complement(22995..24539) /gene="ppp" /locus_tag="MRA_0020" /codon_start=1 /transl_table=11 /product="putative serine/threonine phosphatase Ppp" /protein_id="ABQ71738.1" /db_xref="GI:148503929" gene complement(24628..25095) /locus_tag="MRA_0021" CDS complement(24628..25095) /locus_tag="MRA_0021" /codon_start=1 /transl_table=11 /product="hypothetical protein" /protein_id="ABQ71739.1" /db_xref="GI:148503930" gene complement(25219..26802) /locus_tag="MRA_0022" CDS complement(25219..26802) /locus_tag="MRA_0022" /codon_start=1 /transl_table=11 /product="hypothetical protein" /protein_id="ABQ71740.1" /db_xref="GI:148503931" 我想将文本文件读入Matlab，并以line基因的信息作为列表中每个项目的起点来列出一个列表。因此，在此示例中，列表中将包含3个项目。我已经尝试了一些方法，但无法使其正常工作。有人对我能做什么有任何想法吗？

回答：

这是算法的快速建议：

用fopen打开文件
从fgetl开始读取行，直到找到以'CDS'开头的行。
保持阅读行，直到获得另一行以'gene'开头的行。
对于（2）和（3）中的线之间的所有线
- 在'/'和'='之间找到字符串。这是栏位名称
- 查找引号之间的字符串。这是领域的价值
将计数器加1，然后从＃2开始直到完成读取文件

这些命令可能会有所帮助：

要查找由特定字符括起来的字符串，请使用例如regexp(lineThatHasBeenRead,'/(.+)=','tokens','once')
要创建输出结构，请使用动态字段名称，例如output(ct).(fieldname) = value;

编辑

这是一些代码。我将您的示例保存为“ test.txt”。

% open file fid = fopen('test.txt'); % parse the file eof = false; geneCt = 1; clear output % you cannot reassign output if it exists with different fieldnames already output(1:1000) = struct; % you may want to initialize fields here while ~eof % read lines till we find one with CDS foundCDS = false; while ~foundCDS currentLine = fgetl(fid); % check for eof, then CDS. Allow whitespace at the beginning if currentLine == -1 % end of file eof = true; elseif ~isempty(regexp(currentLine,'^\s+CDS','match','once')) foundCDS = true; end end % looking for CDS if ~eof % read (and remember) lines till we find 'gene' collectedLines = cell(1,20); % assume no more than 20 lines pere gene. Row vector for looping below foundGene = false; lineCt = 1; while ~foundGene currentLine = fgetl(fid); % check for eof, then gene. Allow whitespace at the beginning if currentLine == -1; % end of file - consider all data has been read eof = true; foundGene = true; elseif ~isempty(regexp(currentLine,'^\s+gene','match','once')) foundGene = true; else collectedLines{lineCt} = currentLine; lineCt = lineCt + 1; end end % loop through collectedLines and assign. Do not loop through the % gene line for line = collectedLines(1:lineCt-1) fieldname = regexp(line{1},'/(.+)=','tokens','once'); value = regexp(line{1},'="?([^"]+)"?$','tokens','once'); % try converting value to number numValue = str2double(value); if isfinite(numValue) value = numValue; else value = value{1}; end output(geneCt).(fieldname{1}) = value; end geneCt = geneCt + 1; end end % while eof % cleanup fclose(fid); output(geneCt:end) = [];

更多&回答...

2019-12-10, 20:41	#1
poster 高级会员注册日期: 2019-11-21 帖子: 3,017 声望力: 67	如何将文本文件读入matlab并使其成为列表？我有一个文本文件，其格式为 gene complement(22995..24539) /gene="ppp" /locus_tag="MRA_0020" CDS complement(22995..24539) /gene="ppp" /locus_tag="MRA_0020" /codon_start=1 /transl_table=11 /product="putative serine/threonine phosphatase Ppp" /protein_id="ABQ71738.1" /db_xref="GI:148503929" gene complement(24628..25095) /locus_tag="MRA_0021" CDS complement(24628..25095) /locus_tag="MRA_0021" /codon_start=1 /transl_table=11 /product="hypothetical protein" /protein_id="ABQ71739.1" /db_xref="GI:148503930" gene complement(25219..26802) /locus_tag="MRA_0022" CDS complement(25219..26802) /locus_tag="MRA_0022" /codon_start=1 /transl_table=11 /product="hypothetical protein" /protein_id="ABQ71740.1" /db_xref="GI:148503931" 我想将文本文件读入Matlab，并以line基因的信息作为列表中每个项目的起点来列出一个列表。因此，在此示例中，列表中将包含3个项目。我已经尝试了一些方法，但无法使其正常工作。有人对我能做什么有任何想法吗？回答：这是算法的快速建议：用fopen打开文件从fgetl开始读取行，直到找到以'CDS'开头的行。保持阅读行，直到获得另一行以'gene'开头的行。对于（2）和（3）中的线之间的所有线在'/'和'='之间找到字符串。这是栏位名称查找引号之间的字符串。这是领域的价值将计数器加1，然后从＃2开始直到完成读取文件这些命令可能会有所帮助：要查找由特定字符括起来的字符串，请使用例如regexp(lineThatHasBeenRead,'/(.+)=','tokens','once') 要创建输出结构，请使用动态字段名称，例如output(ct).(fieldname) = value; 编辑这是一些代码。我将您的示例保存为“ test.txt”。 % open file fid = fopen('test.txt'); % parse the file eof = false; geneCt = 1; clear output % you cannot reassign output if it exists with different fieldnames already output(1:1000) = struct; % you may want to initialize fields here while ~eof % read lines till we find one with CDS foundCDS = false; while ~foundCDS currentLine = fgetl(fid); % check for eof, then CDS. Allow whitespace at the beginning if currentLine == -1 % end of file eof = true; elseif ~isempty(regexp(currentLine,'^\s+CDS','match','once')) foundCDS = true; end end % looking for CDS if ~eof % read (and remember) lines till we find 'gene' collectedLines = cell(1,20); % assume no more than 20 lines pere gene. Row vector for looping below foundGene = false; lineCt = 1; while ~foundGene currentLine = fgetl(fid); % check for eof, then gene. Allow whitespace at the beginning if currentLine == -1; % end of file - consider all data has been read eof = true; foundGene = true; elseif ~isempty(regexp(currentLine,'^\s+gene','match','once')) foundGene = true; else collectedLines{lineCt} = currentLine; lineCt = lineCt + 1; end end % loop through collectedLines and assign. Do not loop through the % gene line for line = collectedLines(1:lineCt-1) fieldname = regexp(line{1},'/(.+)=','tokens','once'); value = regexp(line{1},'="?([^"]+)"?$','tokens','once'); % try converting value to number numValue = str2double(value); if isfinite(numValue) value = numValue; else value = value{1}; end output(geneCt).(fieldname{1}) = value; end geneCt = geneCt + 1; end end % while eof % cleanup fclose(fid); output(geneCt:end) = []; 更多&回答...